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We  propose  a  new  family  of  market-based  distributed  planning  algorithms  for  collaborative  multi-agent 
systems  with  complex  shared  constraints.  Such  constraints  tightly  couple  the  agents  together,  and  appear 
in  problems  ranging  from  task  or  resource  allocation  to  collision  avoidance.  While  it  is  not  immediately 
obvious,  a  wide  variety  of  constraints  can  in  fact  be  reduced  to  generalized  resource  allocation  problems; 
the  translation  is  straightforward  for  task  or  resource  allocation  problems,  and  for  general  problems  any 
shared  constraint  between  agents  can  be  considered  a  resource.  For  example,  satisfying  collision-avoidance 
constraints  is  equivalent  to  assigning  to  a  limited  number  of  agents  the  right  to  claim  a  particular  space  at 
a  particular  time.  Market-based  algorithms  have  become  popular  in  collaborative  multi-agent  planning 
particularly  for  task  allocation,  due  to  their  intuitive  and  simple  distributed  paradigm  as  well  as  their 
success  in  domains  such  as  robotics  and  software  agent  systems.  However,  they  suffer  from  several  main 
drawbacks:  1.  it  is  somewhat  of  an  art  to  create  a  reasonable  pricing  in  each  domain,  requiring  a  human 
designer  and  parameter  tuning,  2.  they  rarely  guarantee  optimality  3.  they  do  not  often  have  a  natural  way 
to  incorporate  uncertainty  in  planning,  and  4.  most  existing  algorithms  require  a  central  trusted 
auctioneer.  This  thesis  presents  a  solution  to  address  the  drawbacks  by  providing  mechanisms  that 
automatically  and  optimally  price  the  resources  as  well  as  simple,  optimal  bidding  strategies  for  the  agents. 
We  formulate  multi-agent  planning  as  factored  mathematical  programs  that  are  optimized  in  a  distributed 
fashion.  Like  other  market-based  planning  algorithms,  our  methods  allow  for  the  optimization  of  each 
agent?s  local  planning  problem  at  the  agent  with  limited  communication;  however,  they  are  able  to  deal 
with  more  complex  constraints  than  those  usually  used  in  market-based  planning  settings.  We  consider 
three  different  settings  and  give  algorithms  for  each  that  compute  resource  prices  automatically,  and  do  so 
to  guarantee  (near-)optimality.  First,  we  formalize  factored  mixed  integer  linear  programs  (MILPs),  and 
give  a  novel  distributed  optimization  algorithm  by  combining  Dantzig-Wolfe  decomposition  a  distributed 
method  for  optimizing  linear  programs,  with  a  cutting-plane  algorithm.  Second  we  relax  the  framework 
with  Lagrangian  relaxation,  for  more  efficient,  convex  optimization.  Lagrangian  relaxation  yields  an 
approximation  to  the  original  MILP,  but  we  give  a  simple  yet  effective  randomized  rounding  algorithm 
whose  chance  of  failure  can  be  bounded,  and  the  result  is  a  near-optimal  approximation  algorithm  for 
problems  with  a  very  large  number  of  agents  contending  for  the  resources.  Finally,  we  study  planning 
under  uncertainty  in 
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Abstract 


We  propose  a  new  family  of  market-based  distributed  planning  algorithms  for  collabora¬ 
tive  multi-agent  systems  with  complex  shared  constraints.  Such  constraints  tightly  couple  the 
agents  together,  and  appear  in  problems  ranging  from  task  or  resource  allocation  to  collision 
avoidance.  While  it  is  not  immediately  obvious,  a  wide  variety  of  constraints  can  in  fact  be 
reduced  to  generalized  resource  allocation  problems;  the  translation  is  straightforward  for 
task  or  resource  allocation  problems,  and  for  general  problems  any  shared  constraint  between 
agents  can  be  considered  a  resource.  For  example,  satisfying  collision-avoidance  constraints 
is  equivalent  to  assigning  to  a  limited  number  of  agents  the  right  to  claim  a  particular  space  at 
a  particular  time. 

Market-based  algorithms  have  become  popular  in  collaborative  multi-agent  planning, 
particularly  for  task  allocation,  due  to  their  intuitive  and  simple  distributed  paradigm  as  well 
as  their  success  in  domains  such  as  robotics  and  software  agent  systems.  However,  they  suffer 
from  several  main  drawbacks:  1.  it  is  somewhat  of  an  art  to  create  a  reasonable  pricing  in  each 
domain,  requiring  a  human  designer  and  parameter  tuning,  2.  they  rarely  guarantee  optimality, 
3.  they  do  not  often  have  a  natural  way  to  incorporate  uncertainty  in  planning,  and  4.  most 
existing  algorithms  require  a  central  trusted  auctioneer.  This  thesis  presents  a  solution  to 
address  the  drawbacks  by  providing  mechanisms  that  automatically  and  optimally  price  the 
resources  as  well  as  simple,  optimal  bidding  strategies  for  the  agents. 

We  formulate  multi- agent  planning  as  factored  mathematical  programs  that  are  opti¬ 
mized  in  a  distributed  fashion.  Like  other  market-based  planning  algorithms,  our  methods 
allow  for  the  optimization  of  each  agent’s  local  planning  problem  at  the  agent  with  limited 
communication;  however,  they  are  able  to  deal  with  more  complex  constraints  than  those 
usually  used  in  market-based  planning  settings.  We  consider  three  different  settings  and 
give  algorithms  for  each  that  compute  resource  prices  automatically,  and  do  so  to  guarantee 
(near-)optimality.  First,  we  formalize  factored  mixed  integer  linear  programs  (MILPs),  and 
give  a  novel  distributed  optimization  algorithm  by  combining  Dantzig-Wolfe  decomposition, 
a  distributed  method  for  optimizing  linear  programs,  with  a  cutting-plane  algorithm.  Second, 
we  relax  the  framework  with  Lagrangian  relaxation,  for  more  efficient,  convex  optimization. 
Lagrangian  relaxation  yields  an  approximation  to  the  original  MILP,  but  we  give  a  simple 
yet  effective  randomized  rounding  algorithm  whose  chance  of  failure  can  be  bounded,  and 
the  result  is  a  near-optimal  approximation  algorithm  for  problems  with  a  very  large  number 
of  agents  contending  for  the  resources.  Finally,  we  study  planning  under  uncertainty  in  the 
Lagrangian  relaxation  framework  via  stochastic  programming,  and  give  efficient  algorithms 
for  representing  and  optimizing  uncertainty  in  factored  Markov  decision  processes. 

We  evaluate  our  algorithms  on  domains  ranging  from  task  allocation  and  path  planning  to 
supply  chain  management,  and  demonstrate  the  power  of  wielding  complex  shared  constraints 
in  distributed  planning,  which  we  hope  will  continue  to  be  studied  in  a  wide  range  of  domains. 
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Chapter  1 
Introduction 


Multi-agent  planning  is  often  naturally  formulated  as  a  mostly  factored  combinatorial  optimization 
problem:  each  agent  has  its  own  local  state,  constraints,  and  objectives,  while  agents  interact  by 
competing  for  scarce,  shared  resources.  Such  a  problem  is  usually  intractable  as  a  centralized 
optimization  problem,  as  the  joint  state  over  all  agents  is  exponential  in  the  number  of  agents. 
Hence,  given  the  natural  factorization  over  agents,  it  is  beneficial  to  seek  a  distributed  solution, 
where  each  agent  solves  its  individual  local  planning  problem  with  a  fast  single-agent  planning 
algorithm. 

Market-based  planners  provide  a  natural  and  intuitive  framework  to  leverage  such  multi-agent 
structure  to  yield  an  efficient  algorithm  for  distributed  planning  in  collaborative  multi-agent 
systems:  each  agent  bids  for  resources  or  tasks,  using  its  planner  both  to  decide  which  resources 
are  worth  acquiring  and  to  decide  how  to  use  resources  if  acquired.  Then  an  auctioneer  or  a 
distributed  mechanism  is  used  to  assign  said  resources  to  the  agents.  Market-based  planners 
impose  low  communication  costs,  are  simple  to  implement,  and  have  been  shown  to  work  well, 
for  example  in  the  task  allocation  domain  [21,  27,  95], 

However,  market-based  planners  suffer  from  several  well-known  limitations.  First,  to  set  up  a 
good  market  is  something  of  an  art:  a  human  designer  must  choose  carefully,  for  every  new 
problem  domain,  a  set  of  commodities  and  a  market  structure  including  a  strategy  for  pricing 
commodities,  tuning  both  to  balance  planning  effort  against  suboptimality.  Second,  most  market- 
based  planners  cannot  offer  any  guarantees  on  the  final  overall  plan  quality.  Third,  most  algorithms 
degrade  quickly  in  the  presence  of  additional  constraints  that  couple  the  agents’  solutions,  such 
as  task  ordering  or  collision  avoidance  constraints,  adding  to  the  design  effort.  (See  [22]  for 
a  further  discussion.)  Finally,  planning  under  uncertainty  presents  a  challenge  to  most  market- 
based  planning  frameworks,  as  they  are  often  not  built  with  uncertainty  in  mind,  nor  built  upon 
mechanisms  that  easily  allow  incoporation  of  uncertainty. 

This  thesis  addresses  these  problems  by  designing  a  set  of  principled  algorithms  which  automati¬ 
cally  and  optimally  price  resources  in  any  domain.  We  term  such  algorithms  Automatically-pricing 
Market-based  Distributed  Multi-agent  Algorithms  (AMDMA).  We  frame  multi-agent  planning 
as  mathematical  programs,  yielding  combinatorial  and  convex  optimization  problems.  Our  al- 


1 


gorithms  naturally  decompose  a  program  into  subproblems ,  which  correspond  to  agents’  local 
planning  problems  and  can  be  solved  using  fast  domain- specific  single-agent  planning  methods, 
and  a  master  program  that  manages  resource  prices  and  communicates  with  the  subproblems  for 
optimal  resource  allocation. 
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Figure  1.1:  A  broad  view  of  topics  covered  in  this  thesis.  Here  we  characterize  the  space  of 
cooperative  market-based  planning  problems  using  the  two  dimensions  we  focus  on:  the  level  of 
coupling  between  resources,  and  planning  under  uncertainty.  A  more  detailed  comparison  of  our 
work  to  research  in  various  fields  ranging  from  mechanism  design  to  operations  research  can  be 
found  in  Chapter  5. 


We  present  three  variants  of  AMDMA  and  give  distributed  optimization  algorithms  for  each 
setting.  Figure  1.1  gives  a  broad  view  of  the  settings  this  thesis  covers.  In  more  detail: 

•  Chapter  2  introduces  the  formalization  of  our  problem  setting  using  mixed  integer  linear 
programs  (MILP).  We  give  a  novel  distributed  MILP  optimization  algorithm  based  on 
Dantzig-Wolfe  decomposition  and  a  cutting  plane  algorithm,  with  convergence  and  optimal¬ 
ity  guarantees.  Using  distributed  SAT  problems  as  an  abstraction  for  planning  problems,  as 
well  as  an  unmanned  aerial  vehicle  task  allocation  and  path  planning  task,  we  show  that 
both  the  sequential  and  the  distributed  executions  of  the  algorithm  result  in  improvement 
over  CPLEX,  the  industry-standard  MILP  solver. 

•  In  Chapter  3,  we  relax  the  MILP  formulation  to  obtain  Lagrangian  relaxations  which  can  be 
optimized  much  more  efficiently  using  convex  optimization  methods.  We  describe  how  to 
obtain  feasible  solutions  using  a  randomized  rounding  scheme,  and  give  conditions  under 
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which  we  can  guarantee  near-optimal  solutions.  We  present  two  gradient-based  convex 
optimization  algorithms  for  solving  the  Lagrangian  relaxation,  subgradient  descent  and  an 
accelerated  gradient  algorithm.  Finally,  we  apply  the  Lagrangian  relaxation  framework  to 
factored  MDPs  in  the  traffic  management  domain.  We  give  an  augmentation  method  for 
factored  MDPs  that  allows  the  use  of  accelerated  optimization  methods  while  maintaining 
fast  subproblem  computation. 

•  Chapter  4  examines  planning  under  uncertainty  in  our  multi-agent  planning  framework. 
In  particular,  we  extend  the  Lagrangian  relaxation  method  introduced  in  Chapter  3  via 
stochastic  programming,  and  give  an  efficient  construction  of  factored  MDPs  under  uncer¬ 
tainty  based  on  sampling.  We  look  at  the  supply  chain  management  domain  as  an  example 
domain  of  heavy  dependencies  between  agents,  and  demonstrate  how  our  stochastic  plan¬ 
ning  algorithm  can  ameliorate  the  bullwhip  effect,  a  well-known  economic  phenomenon  of 
inefficiency  due  to  suboptimal  handling  of  uncertainty. 

The  rest  of  the  thesis  provides  relevant  related  work  for  each  problem  setting  and  concludes  with 
further  discussion  and  future  work. 
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Chapter  2 

Distributed  Market-Based  Multi- Agent 
Planning 


Market-based  algorithms  have  become  popular  in  collaborative  multi-agent  planning  due  to  their 
simplicity,  distributedness,  low  communication  requirements,  and  proven  success  in  domains 
such  as  task  allocation  and  robotic  exploration.  Most  existing  market-based  algorithms,  however, 
suffer  from  two  main  drawbacks:  resource  prices  must  be  carefully  handcrafted  for  each  problem 
domain,  and  there  is  no  guarantee  on  final  solution  quality. 

Throughout  this  thesis,  we  will  address  these  problems  under  different  settings  using  algorithms 
that  automatically  and  optimally  prices  resources  in  any  domain.  In  this  chapter,  we  formalize 
our  problem  setting,  and  lay  out  the  foundations  for  our  solution  framework  in  the  form  of  a 
general  distributed  solver  that  prices  resources  as  well  as  designs  new  resources,  correcting  pricing 
imbalances  by  adding  “derivative  securities”  based  on  the  original  resources  to  the  market. 

We  formulate  planning  problems  as  mixed  integer  linear  programs  (MILPs);  MILPs  are  a  popular 
representation  for  planning  problems  not  only  in  artificial  intelligence  [85],  but  also  in  operations 
research  [66],  cooperative  control  [2,  69],  and  game  theory  [72],  The  MILP  representation 
lets  us  easily  generalize  the  usual  task  allocation  setting  to  handle  complex  constraints  over 
multiple  agents.  Also,  the  popularity  of  MILPs  makes  our  method  readily  applicable,  since  for 
many  problems  a  MILP  formulation  already  exists;  and  we  expect  an  immediate  impact  as  the 
distributed  nature  of  our  methods  alone  will  enable  larger  problems  to  be  solved  than  previously 
possible.  Finally,  MILPs  are  commonly  used  to  represent  problems  of  planning  under  uncertainty 
(e.g.,  [66,  72]).  Most  market-based  planning  algorithms  do  not  account  for  uncertainty,  e.g.  in 
future  resource  availability  or  future  tasks,  and  therefore  can  return  suboptimal  plans  in  such 
domains;  our  method  therefore  holds  the  promise  of  extending  optimal  market-based  planning  to 
uncertain  domains.  In  Chapter  4,  we  extend  our  framework  to  efficiently  incorporate  handling  of 
uncertainty. 

Our  algorithm  is  based  on  Dantzig-Wolfe  (D-W)  decomposition  [11,  Ch.  6],  a  classical  column 
generation  technique  for  linear  programs  (LPs),  in  which  the  problem  is  reformulated  into  a 
master  program  enforcing  shared  resource  constraints,  along  with  subproblems  for  individual 
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agents.  The  master  program  corresponds  to  the  auctioneer  in  market-based  planners,  choosing  the 
resource  allocation  and  giving  agents  (subproblems)  the  current  prices  of  the  resources.  Based 
on  the  current  prices,  the  agents  iteratively  change  their  demands  towards  a  globally  optimal 
solution. 

As  D-W  decomposition  is  defined  only  for  LPs,  we  extend  the  formulation  to  MILPs  using 
Gomory  cuts  [30],  a  general  cutting-plane  algorithm,  which  allows  us  to  retain  optimality  and 
finite-time  convergence  guarantees  of  the  D-W  decomposition  algorithm  for  LPs.  The  cutting 
plane  algorithm  generates  new  constraints  to  “cut  off”  the  optimal  solution  of  the  LP  relaxation. 
Each  new  constraint  can  be  interpreted  as  a  derivative  resource  representing  the  discretized 
consumption  level  of  a  combination  of  the  original  resources.  Hence  its  price  can  represent  ideas 
like  discounts  or  penalties  on  particular  baskets  of  resources  (see  Sec.  2.4  for  examples). 

Since  subproblems  are  independent  of  one  another,  all  subproblem  computation  can  be  distributed 
for  speed  and  robustness.  While  we  do  not  specifically  address  distributed  computing  issues 
like  CPU  failure  or  communication  complexity  in  this  thesis,  our  algorithm  is  anytime :  feasible 
joint  plans  may  be  generated  before  reaching  the  optimal  solution,  and  if  the  situation  requires, 
can  be  executed.  Also,  since  we  consider  collaborative  domains,  the  auctioneer’s  computation 
may  be  replicated  on  multiple  machines  (or  agents)  for  redundancy.  Finally,  our  algorithm  has 
low  communication  requirements,  since  agents  communicate  only  when  they  might  contend  for 
resources,  and  they  never  need  to  communicate  their  individual  problems  or  complete  plans  to 
a  server  or  to  each  other.  Hence  our  algorithm  is  well-suited  for  situations  where  it  is  difficult 
to  communicate  large  sets  of  constraints  and  objectives  from  individual  problems  to  a  server,  or 
where  centralized  computation  is  not  desirable  for  robustness. 

In  this  chapter,  we  conduct  experiments  on  randomly-generated  factored  integer  programs,  which 
are  both  difficult  to  solve  and  a  versatile  representation  for  planning  problems,  as  well  as  on 
simulated  unmanned  aerial  vehicle  (UAV)  task  allocation  and  path  planning  problems.  The 
experiments  demonstrate  that  (1)  when  applicable,  market-based  planning  can  lead  to  large 
efficiency  gains,  outperforming  an  industrial-strength  MILP  solver  (CPLEX  [39])  for  large 
problems,  even  if  we  force  all  computation  to  be  sequential;  and  (2)  the  benefit  can  become 
even  greater  in  distributed  settings,  where  we  can  take  advantage  of  each  agent’s  computational 
resources  without  a  large  communication  overhead. 


2.1  Problem  Formulation 

We  formulate  multi-agent  planning  problems  as  MILPs  in  standard  form,  factored  over  n 
agents: 

min  E  Ci(xi)  (2.1.1) 

i=l:n 

s.t.  DiXi  =  b  (2.1.2) 

i=l:n 

Xi  £  C),  i  —  (2.1.3) 
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where  x,  represents  the  plan  for  agent  i,  Ci  its  domain,  i.e.,  the  set  of  plans  satisfying  agent  z’s 
individual  constraints,  and  <y(ay),  not  necessarily  convex,  its  cost  function.  Each  Xi  is  a  mixed 
integer  vector  (its  elements  may  be  integers  or  reals);  for  convenience  we  assume  that  Ci  is 
bounded.  (2.1.2)  defines  the  shared  constraints,  where  each  Ai  is  a  matrix  with  the  same  number 
of  rows,  i.e.,  the  number  of  shared  constraints. 

In  the  following  section,  we  introduce  Dantzig- Wolfe  (D-W)  decomposition  [11,  Ch.  6],  which  is 
defined  for  LPs.  We  then  describe  how  D-W  decomposition  can  be  combined  with  a  cutting  plane 
algorithm  to  solve  MILPs. 


2.2  Algorithm 

2.2.1  Dantzig- Wolfe  Decomposition 

Consider  a  problem  of  the  form  (2. 1.1 -2. 1.3)  where  each  Ci  is  convex  and  polyhedral  (relaxed 
from  the  presentation  above).  In  D-W  decomposition,  we  reformulate  the  program  to  consist  of 
a  master  program  and  n  subproblems.  The  master  program  is  defined  in  terms  of  the  n,  basic 
feasible  solutions  x\ . . .  £  Ct  to  each  individual  subproblem  i;  note  n,  may  be  very  large. 

Its  variables  are  wj,  indicators  of  whether  the  individual  solution  x{  is  part  of  the  optimal  joint 
solution: 


min  c(xi)wi 

%—l:n  j=l:rii 

(2.2.1) 

s.t.  y  y  wiDixi  =  b>  wi  e  [o,i] 

i=l:n  j=l:rii 

(2.2.2) 

Y  wj  =  1,  i=l,...,n. 

j=l:rii 

(2.2.3) 

The  number  of  constraints  in  the  master  program  may  be  much  smaller  than  in  the  original 
formulation,  as  the  master  does  not  include  the  individual  constraints.  However,  the  number  of 
variables  may  be  prohibitively  large,  as  it  is  equal  to  the  number  of  possible  individual  basic 
plans  for  all  agents.  We  thus  define  and  solve  the  restricted  master  program,  whose  variables 
include  only  a  subset  of  all  possible  plans,  selected  by  variable  generation.  In  more  detail,  we  can 
find  a  basic  feasible  solution  to  (2.2. 1-2. 2.3)  by  solving  the  restricted  master,  which  is  identical 
to  (2.2. 1-2. 2. 3)  except  that  some  w]  are  forced  to  be  zero.  As  in  the  simplex  method  for  LPs 
(see,  e.g.,  [11,  Ch.  3]),  to  determine  whether  our  current  solution  is  optimal,  we  can  observe  the 
reduced  cost  of  each  non-basic  variable  in  the  solution.  The  reduced  cost  of  a  variable  wj  is 

ct(xj)  -  qTDixj  -  pi,  (2.2.4) 

where  q  contains  the  values  of  dual  variables  for  shared  constraints  (2.2.2)  at  the  current  basic 
solution,  and  /x,  is  the  value  of  the  dual  variable  for  the  sum-to-1  constraint  (2.2.3)  for  agent  i. 
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To  find  the  variable  wj  with  the  least  reduced  cost  (so  we  can  add  it  to  our  basis),  we  can  solve  a 
modified  subproblem  for  agent  i: 

min  Ci(xi)  -  qT DiXi  (2.2.5) 

s.t.  Xi  E  Ci .  (2.2.6) 

Note  that  we  have  altered  the  subproblem  only  in  its  objective  and  only  by  a  linear  term,  so 
domain-specific  algorithms  will  typically  still  be  able  to  solve  the  altered  subproblem.  If  wj  was 
not  already  in  the  restricted  master,  we  can  now  add  it,  guaranteeing  progress  when  we  solve  the 
restricted  master  again. 

For  conciseness,  we  write  the  restricted  master  as: 

min  cTw  s.t.  Dw  =  b ,  (2.2.7) 

where  we  have  renumbered  the  variables  to  be  w\ . . .  Wk,  and  collected  together  the  constraints 
Dw  =  b  and  objective  cTw.  A  subproblem  solution  xj  corresponds  to  a  column  A,xj  in  D  and  an 
entry  ct (xj )  in  c.  For  convenience,  we  have  incorporated  the  sum-to-1  constraints  (2.2.3)  into  D, 
so  b  =  (1, . . . ,  1, 6T)T. 

Recall  that  in  the  market-based  view,  each  shared  constraint  is  considered  a  resource.  The 
dual  values  q  communicated  to  subproblems  then  can  be  interpreted  as  resource  prices,  and 
Dij  as  the  usage  level  of  resource  i  by  plan  j,  as  can  be  seen  from  the  subproblem  objective 
(2.2.5).  If  a  constraint  is  saturated  by  plans  currently  in  the  restricted  master,  the  corresponding 
resource  will  have  a  non-zero  price,  leading  to  new  individual  plans  that  avoid  heavy  usage  of  the 
resource. 

Algorithm  1  shows  an  outline  of  the  Dantzig-Wolfe  decomposition  algorithm.  For  linear  programs, 
the  master  program  and  all  subproblems  are  linear  programs,  and  steps  1  and  2  can  be  solved  by  a 
LP  solver. 

Algorithm  1  Dantzig-Wolfe  decomposition  algorithm 

0.  Solve  subproblems  with  resource  prices  =  0;  use  solutions  to  initialize  the  restricted  master.1 
Repeat: 

1.  Solve  the  restricted  master;  get  dual  values  q,  /j. 

2.  Solve  subproblems  using  new  resource  prices  q. 

3.  For  each  subproblem  i’s  returned  solution  x.t,  if  the  objective  value  satisfies  cl(xl)  — 
qT D^i  <  fj,i,  generate  a  new  column  and  variable  in  the  restricted  master. 

4.  If  no  new  column  has  been  generated  in  this  iteration,  terminate. 


As  presented,  Alg.  1  may  not  terminate  in  a  finite  number  of  iterations  when  degeneracy  is  present; 
however,  anticycling  rules  may  be  employed  to  ensure  finite-time  termination,  as  typically  done 
in  the  simplex  method.  See  [19,  Ch.  23-1]  for  a  discussion  on  anticycling  rules  applicable  to  the 
D-W  decomposition  algorithm. 

'To  handle  infeasibility  in  the  restricted  master,  we  include,  for  each  agent,  a  fictitious  plan  that  has  feasible 
resource  usages  and  a  very  large  cost.  Picking  fictitious  plans  will  lead  to  high  prices  for  over-subscribed  resources, 
guiding  the  subproblems  to  return  plans  that  better  meet  the  resource  constraints. 
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2.2.2  Incorporating  a  Cutting  Plane  Algorithm 


To  represent  and  solve  a  mixed  integer  program  using  D-W  decomposition,  we  must  add  the 
integrality  constraints  w  £  {0,  l}fc  to  the  master  (2.2.2)  and  the  restricted  master  (2.2.7).  (If  C,  is 
convex,  these  constraints  are  unnecessary.  However,  for  MILPs,  a  convex  combination  of  plans 
may  not  be  a  valid  plan,  so  we  need  the  integrality  constraints.)  To  do  so,  we  will  employ  a  cutting 
plane  algorithm.  For  notational  simplicity  we  will  assume  that  each  D,  has  integer  entries,  so  that 
all  resource  usages  are  integral2 

Cutting  plane  algorithms  solve  a  mixed  integer  program  by  adding  cuts,  or  additional  constraints, 
to  its  LP  relaxation,  thereby  “cutting  off”  fractional  solutions  and  eventually  reaching  an  integral 
optimal  solution  in  the  LP  relaxation.  To  use  a  cutting  plane  algorithm  to  solve  the  master  program, 
for  each  set  of  cuts  we  can  use  D-W  decomposition  to  solve  the  LP  relaxation  of  the  master 
program  (i.e.,  without  the  integrality  constraints).  The  process  is  summarized  as  Alg.  2,  and 
here  we  use  the  Gomory  method  for  generating  cuts,  although  other  cutting  plane  recipes  can  be 
applied  as  well.  This  algorithm  is  naturally  distributed:  as  in  the  original  D-W  decomposition 
method,  subproblems  are  solved  independently  by  each  agent,  while  the  restricted  master  program 
is  either  solved  by  a  designated  agent,  or  replicated  on  several  agents  simultaneously  using  a 
deterministic  algorithm.  Then,  cuts  can  be  created  by  a  designated  agent,  or  by  several  agents 
simultaneously  using  a  deterministic  algorithm. 


Algorithm  2  Price-and-cut  market-based  planning 
Repeat: 

1.  Perform  rn  iterations  of  steps  1-4  in  Dantzig- Wolfe  decomposition  algorithm,  or  perform 
the  algorithm  to  termination  (rn  =  oo),  and  return  its  optimal  solution  w  to  the  restricted 
master  LP  relaxation.  Report  whether  w  is  optimal  for  the  full  master  LP  relaxation,  i.e., 
whether  any  new  column  was  generated  in  step  4. 

2.  If  w  is  integral,  and  is  optimal  for  the  full  master  LP  relaxation,  terminate. 

3.  If  w  is  not  integral,  perform  Gomory  cuts  and  add  constraints  to  the  restricted  master 
program,  until  k  cuts  have  been  made  or  no  more  cuts  are  available  (k  =  oo)  for  the  current 
LP  solution  w. 


The  main  algorithm,  price-and-cut,  admits  two  parameters,  which  allows  different  schedules 
over  iterations  of  D-W  and  Gomory  cuts.  Different  schedules  may  lead  to  varying  optimality 
guarantees,  and  in  practice,  to  different  execution  times.  One  particularly  illuminating  schedule  is 
m  =  1  and  k  =  oo;  this  version  of  price-and-cut  is  equivalent  to  applying  D-W  decomposition 
(Alg.  1)  to  the  MILP  and  solving  the  restricted  master  in  step  1  to  its  integer  optimal  solution. 
On  the  other  extreme  is  m  =  oo  and  k  =  1,  in  which  we  solve  the  master  LP  relaxation  exactly 
and  introduce  a  single  cut  at  each  iteration.  This  latter  schedule  guarantees  optimality  in  a  finite 
number  of  iterations,  but  in  practice  may  prove  less  efficient  than  other  schedules,  since  we  must 
find  the  optimal  solution  to  the  full  master  at  each  iteration.  We  give  optimality  results  for  general 
schedules  in  Sec.  2.3. 

2If  not,  we  can  always  divide  the  entries  of  D\, . . . ,  Dn  by  a  common  denominator,  given  that  each  D,  has 
rational  number  entries. 
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Two  issues  arise  in  applying  cutting  plane  algorithms  to  D-W  decomposition.  First,  since 
columns  are  generated  incrementally,  when  we  generate  a  new  cut,  we  do  not  want  to  compute 
all  of  its  elements  immediately — else  we  lose  the  benefit  of  a  small  restricted  master.  Thus 
we  need  an  efficient  way  to  record  our  cuts  and  compute  new  columns  for  the  cut  constraints 
incrementally.  Second,  the  new  constraints  must  be  taken  into  account  in  the  subproblems. 
Intuitively,  new  constraints  become  new  resources;  we  refer  to  these  resources  as  derivative 
resources,  to  differentiate  them  from  the  resources  corresponding  to  the  original  constraints. 
Derivative  resource  usages  will  be  a  function  of  original  resource  usages,  since  cuts  are  generated 
by  performing  operations  on  subsets  of  existing  constraints.  However,  the  functions  will  typically 
be  nonlinear  with  respect  to  the  variables  in  the  subproblem,  unlike  the  original  resources  in 
expression  (2.2.5);  depending  on  the  form  of  the  individual  problems,  it  may  be  easy  or  difficult 
to  plan  to  optimize  combined  (original  and  derivative)  resource  usage. 

As  we  will  see,  it  is  relatively  straightforward  to  solve  both  issues  when  using  Gomory  cuts,  which 
is  why  we  choose  Gomory  cuts  here.  But,  the  Gomory  method  is  only  one  of  many  cutting-plane 
algorithms,  and  we  expect  that  other  rules  may  be  used  in  place  of  Gomory  cuts,  as  long  as  the 
two  issues  above  can  be  resolved. 


2.2.3  The  Gomory  Cutting  Plane  Algorithm 

Suppose  we  have  an  optimal  basic  solution  to  the  LP  relaxation  of  the  restricted  master  program, 
associated  with  the  basis  B,  which  is  composed  of  the  columns  of  A  that  correspond  to  the  basic 
variables  in  the  solution.3  To  make  a  cut  on  the  constraints  (2.2.7),  we  first  choose  a  row  of  B _1, 
say  ( B~l)k ,  such  that  (B~l  )kb,  the  constant  term  in  the  constraint,  is  fractional.  For  example,  one 
may  choose  a  row  randomly  based  on  the  magnitude  of  the  fractional  component  of  ( B~l)kb .  The 
cut  then  has  the  form: 


<  [{B~l)kb\,  (2.2.8) 

3 

where  D,3  denotes  the  j-th  column  of  D.  The  cut  is  added  to  the  constraint  matrix  D,  using  a 
slack  variable  to  maintain  the  standard  form. 

The  cut  is  a  valid  inequality :  any  integral  point  that  satisfies  all  original  constraints  in  (2.2.2) 
satisfies  the  new  inequality.  Also,  the  current  LP  optimal  solution  violates  the  new  constraint, 
ensuring  progress  (for  details,  see  [11,  Ch.  11]).  Furthermore,  when  no  more  cuts  are  available, 
we  have  an  integral  optimal  solution.  These  properties  guarantee  that  the  Gomory  cutting  plane 
algorithm  is  a  finitely  terminating  algorithm  for  solving  MILPs,  including  the  integer  master 
program. 

We  now  discuss  how  to  resolve  the  two  aforementioned  issues  under  Gomory  cuts. 

3  Using  the  simplex  method  to  solve  the  master  LP  relaxation  automatically  gives  us  the  basis  needed  for  making 
Gomory  cuts,  which  is  another  advantage  of  Gomory  cuts. 
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The  Cut  Recipe 


To  solve  the  first  issue  of  efficiently  generating  new  columns  for  cut  constraints,  we  can  simply 
store  a  “recipe”  for  each  cut.  Since  a  Gomory  cut  only  requires  a  sum  over  the  columns,  we  can 
simply  generate  each  coefficient  as  its  column  is  generated.  (Other  coefficients  are  multiplied  by 
zero  and  therefore  ignorable,  since  their  corresponding  variable  is  not  in  the  restricted  master  yet.) 
Let  denote  the  row  of  the  basis  used  to  make  the  cut,  and  i  refer  to  the  row  number  of  the 

cut  in  D.  The  coefficient  for  a  new  column  j  is 

bl3  =  [(B~l)kb{  (2.2.9) 

where  denotes  the  first  i  —  1  rows  of  the  j-th  column  of  D.  For  each  cut,  we  need  to 

store  only  the  row  rk  =  (B~l)k  used  to  make  the  cut,  which  is  a  vector  the  size  of  the  basis  set 
(the  number  of  rows  (i.e.,  constraints)  in  D  at  the  time  of  the  cut).  Note  that,  for  multiple  cuts,  we 
need  to  generate  coefficients  serially,  in  the  order  in  which  the  cuts  were  added  to  D,  since  each 
cut  may  depend  on  previous  ones.  This  fact  causes  two  problems:  we  can’t  parallelize  coefficient 
generation,  and  the  coefficients  Dl3  may  grow  and  cause  numerical  problems  as  we  layer  more 
cuts  on  top  of  each  other.  The  first  problem  is  not  a  big  one  since  coefficient  computation  not  a 
major  part  of  the  overall  computational  cost.  The  second  problem  can  be  difficult  to  resolve,  but 
we  leave  it  as  a  subject  for  future  work. 


Derivative  Resources  in  Subproblems 

For  subproblem  i,  define  yt  to  be  the  usage  vector  of  original  resources  and  Zi  to  be  the  usage 
vector  of  derivative  resources.  Recall  that  the  usage  for  resource  k  for  column  j  is  equal  to  the 
element  Dkj  of  the  master  program’s  constraint  matrix.  Accordingly,  as  we  saw  in  the  subproblem 
objective  (2.2.5),  original  resource  usage  y;  by  a  plan  x%  is  simply  y,  =  Dyty.  We  can  also  write 
zik,  the  usage  of  derivative  resource  k,  in  terms  of  y,  and  previous  elements  of  z%.  Let  k'  denote 
the  row  number  in  D  corresponding  to  the  derivative  resource  k,  and  let  rk  refer  to  the  cut  recipe 
as  defined  in  (2.2.9).  Then,  for  column  j,  rewriting  (2.2.9)  in  terms  of  y%  and  z%  gives: 

zik  =  [rku\,  (2.2.10) 

where  u  =  (ej  yj  ^1.fc_1-))T,  with  et  an  n-dimensional  unit  vector  whose  z-th  element  is  1, 
corresponding  to  the  sum-to-1  constraints  in  the  master  program.  Incorporating  the  new  variables, 
the  subproblem  objective  now  becomes 

min  a(xi)  -  qT(yJ  zJ)T 

and  we  add  the  expressions  above  for  yi  and  zik  as  additional  constraints.  Now,  we  can  encode 
the  non-linear  constraints  (2.2.10)  as  integer  linear  constraints,  which  allows  us  to  use  a  general 
MILP  solver  to  solve  the  subproblems: 

Zik  — 

Zik  >  rku-  (1-  —), 

Zik  ^  ^ 
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where  M  is  the  least  common  multiple  of  the  denominators  of  the  coefficients  in  r/0 . 

Depending  on  the  particular  subproblem  solver  used,  we  may  have  to  handle  derivative  resources 
in  a  domain-specific  way.  In  general,  adding  derivative  resources  to  a  subproblem  can  increase 
the  size  of  its  state  space,  since  the  subproblem  planner  may  now  need  to  track  resource  usage. 
However,  note  that  there  is  a  limit  to  the  possible  increase  in  state  space  size,  since  at  worst  we 
need  to  keep  track  of  our  usages  of  all  of  the  original  resources:  usage  of  any  derivative  resource 
is  a  deterministic  function  of  the  original  resource  usages.  Furthermore,  depending  on  the  domain, 
a  subproblem  solver  may  already  keep  track  of  some  or  all  of  the  original  resource  usages,  in 
which  case  the  state  space  increase  is  limited  still  further. 


2.3  Optimality  and  Termination 

We  now  present  conditions  for  optimality  and  termination  of  price-and-cut. 

Theorem  2.3.1  (Optimality).  For  mixed  integer  programs  of  the  form  (2.1.1)-(2.1.3),  the  solution 
returned  by  price-and-cut  using  optimal  subproblem  solvers  is  an  optimal  solution  to  the  master 
program. 

Proof  At  termination,  the  solution  is  optimal  to  the  full  master  program  LP  relaxation,  and  is 
integral,  which  implies  that  it  is  also  an  optimal  solution  to  the  master  integer  program. 

Theorem  2.3.2  (Termination  for  IP).  For  integer  programs  of  the  form  (2.1.1)-(2.L3)  with 
bounded  variables,  price-and-cut  under  the  schedule  with  m  =  1  and  k  =  oo  will  terminate  within 
a  finite  number  of  iterations  of  both  price-and-cut  and  D-W  if  the  restricted  master  programs  are 
nondegenerate  or  an  anticycling  rule  is  used. 

Proof  sketch.  As  mentioned  in  Section  2.2.2,  price-and-cut  with  the  said  schedule  is  equivalent  to 
the  Dantzig- Wolfe  decomposition  algorithm  where  the  restricted  master  is  solved  to  its  integer 
optimal  solution  every  iteration.  The  integer  optimal  solution  is  guaranteed  to  be  found  in  finite 
time  due  to  the  finite- time  guarantee  for  Gomory  cuts.  Also,  there  can  only  be  a  finite  number 
of  iterations  inside  D-W,  since  Gomory  cuts  do  not  affect  the  number  of  integer  solutions,  and 
thus  only  finitely  many  columns  can  be  added.  Anticycling  prevents  the  subproblem  solvers  from 
returning  the  same  column  infinitely  often. 

Theorem  2.3.3  (Termination  for  MILP).  For  mixed  integer  programs  of  the  form  (2.1.1  )-(2.1.3) 
with  bounded  variables,  price-and-cut  using  optimal  subproblem  solvers  will  terminate  within 
a  finite  number  of  iterations  of  both  price-and-cut  and  D-W,  if  the  employed  schedule  allows 
only  a  finite  number  of  iterations  of  price-and-cut  between  applications  of  Gomory  cuts  to  a 
basic  optimal  solution  of  the  full  master’s  LP  relaxation,  and  the  restricted  master  programs  are 
nondegenerate  or  an  anticycling  rule  is  used. 

Proof.  The  finite-time  termination  guarantee  of  Gomory  cuts  ensures  that  the  number  of  iterations 
of  price-and-cut  spent  on  cutting  basic  LP  optimal  solutions  to  the  full  master  is  finite,  and  there 
are  only  a  finite  number  of  iterations  between  such  iterations.  Each  call  to  D-W  is  guaranteed  to 
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terminate  in  a  finite  number  of  iterations  (with  anticycling)  if  m  =  oo  in  the  schedule,  or  it  will 
terminate  in  m  iterations. 

While  finite-time  guarantees  give  us  little  assurance  for  the  actual  execution  time  of  price-and-cut 
(and  we  would  not  expect  otherwise,  since  general  integer  programming  is  NP-hard),  as  we  will 
see,  our  experiments  suggest  that  only  a  small  number  of  iterations  of  price-and-cut  may  be 
required  in  practice. 


2.4  Illustrating  Derivatives 

Before  we  show  experimental  results  on  larger  sized  problems,  we  present  a  simple  example 
to  provide  intuition  for  the  form  and  interpretation  of  derivative  resources  created  by  Gomory 
cuts. 

Consider  a  small  grid  world  consisting  of  four  positions  and  two  agents,  as  shown  in  Figure  2.1. 
Agent  1  starts  in  position  1,  and  must  to  go  to  position  4;  agent  2  must  do  the  opposite.  At  each 
time  step,  an  agent  can  move  to  a  neighboring  position,  with  constraints  that  agents  cannot  occupy 
the  same  position  simutaneously,  and  also  may  not  swap  positions,  i.e.,  occupy  the  same  edge 
between  two  positions  simultaneously.  It  is  easy  to  see  the  bottleneck  at  position  2:  one  agent 
must  wait  in  position  3  for  the  other  to  pass  through  position  2. 


Figure  2.1:  A  grid  world  with  four  positions 

In  a  typical  run,  our  algorithm  creates  five  cuts  before  termination,  which  are  all  tight  at  termina¬ 
tion;  we  will  examine  the  first  two  cuts  here.  Our  variables  are  binary:  x'-t  =  1  means  that  agent  i 
is  at  position  j  at  time  t.  The  discretizing  (floor)  operations  in  Gomory  cuts  make  it  possible  for 
us  to  write  the  cut  as  a  combination  of  logical  constraints  between  the  position  variables,  which  is 
quite  natural  to  intepret.  For  example,  we  will  see  conjunctions  (variables  connected  by  A),  which 
will  represent  partial  paths,  and  are  examples  of  “baskets”  of  resources.  For  convenience,  we  will 
use  the  convention  that  true  and  false  correspond  to  1  and  0  respectively  and  write  our  cuts  in 
mixed  logic  and  arithmetic  notations.  The  first  cut  we  obtain  is  of  the  form: 

(x22  A  £43)  +  (x22  V  (x42  A  3^23 ))  —  !• 

First  we  see  that,  as  expected,  the  cut  heavily  concerns  position  2,  which  is  the  bottleneck  resource. 
Furthermore,  this  cut  penalizes  agent  1  for  the  partial  path  (2@£2, 4@£3)  of  being  at  position  2  at 
time  2  and  position  4  at  time  3,  and  agent  2  for  the  partial  path  (4@£2,  2@£3),  but  the  penalization 
simply  corresponds  to  the  original  constraint  that  the  agents  should  not  swap  positions  in  a  time 
step.  However,  it  also  penalizes  agent  2  being  in  position  2  at  time  2,  only  if  agent  1  tries  to  be  in 
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position  2  at  time  2  and  in  position  4  at  time  3,  quantifying  the  evident  correlation  between  the 
occupation  of  position  2  at  time  2  and  the  movement  of  the  two  agents  in  the  vicinity. 

Perhaps  the  more  interesting  is  the  second  cut,  which  is  derived  from  the  first  cut  as  well  as  two  of 
the  original  constraints: 


[(xi2  A  x23)  V  ( x22  A  £43)]  +  ( x22  A  £43))  <  1. 

This  new  constraint  penalizes  agent  1  for  taking  the  partial  path  (2@f2, 4@£3).  However,  it  does 
not  penalize  agent  2  for  taking  the  partial  path  (4@£2,  2@f3)  and  thus  does  not  duplicate  any  of 
the  original  constraints;  the  derivative  resource  provides  a  way  to  guide  the  agents  that  had  been 
previously  unavailable.  In  the  final  solution  agent  1  ends  up  yielding  position  2  at  time  2  to  agent 
2  partly  due  to  this  constraint;  the  constraint  is  tight  at  the  final  optimal  solution,  where  we  see  the 
partial  path  (1@£2,  2@f3)  for  agent  1. 


2.5  Experiments 

2.5.1  Timing  Experiments 

We  demonstrate  the  effectiveness  of  price-and-cut  planning  (PC)  and  its  distribted  version  (DPC) 
on  randomly-generated  factored  zero-one  integer  programs.  This  domain  is  both  difficult  (general- 
purpose  solvers  take  103-104  seconds)  and  relevant:  for  example,  the  method  of  propositional 
planning  [44]  encodes  a  planning  problem  as  a  similar  constraint-satisfaction  problem,  with 
feasible  points  corresponding  to  feasible  plans,  and  an  objective  corresponding  to  plan  cost.  Our 
goal  is  two-fold:  (1)  to  show  that  PC  is  an  efficient  solver  for  factored  integer  programs,  and  (2) 
to  investigate  the  effects  of  communication  overhead  required  by  PC  in  distributed  settings. 

To  generate  a  random  instance,  we  pick  a  number  of  variables  and  constraints,  and  for  each 
constraint  we  select  a  sparse  set  of  variables  and  integer  coefficients  at  random.  (Hence  the 
constraint  matrices  A,  and  the  bounds  b  in  each  instance  start  out  integral;  integrality  of  the 
original  Ai  and  Xi  imply  integral  usages  for  the  original  resources,  which  allows  us  to  use  the 
integer  program  subproblem  formulation  from  Section  2.2.3.)  To  make  sure  the  instance  is 
factored,  we  partition  our  variables  into  subsets,  and  generate  a  set  fraction  of  our  constraints 
among  variables  in  the  same  subset;  for  the  remaining  shared  constraints,  we  select  variables  at 
random  from  all  subsets  at  once. 

In  our  experiments,  we  use  random  3SAT  constraints,  together  with  the  objective  “set  as  few 
variables  to  1  as  possible.”  We  picked  SAT  constraints  so  that  we  can  set  the  ratio  of  constraints  to 
variables  near  the  well-known  empirical  hardness  threshold  of  4.26  [29];  however,  the  resulting 
problem  is  not  a  SAT  instance  due  to  the  objective,  and  therefore  SAT-specific  solvers  are  not 
directly  applicable.  Our  implementation  uses  the  CPLEX  MILP  solver  [39],  a  widely-used 
commercial  optimization  package,  for  the  integer  program  subproblems,  and  the  CPLEX  simplex 
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LP  solver  for  the  restricted  master  LP  relaxation.4  We  use  the  schedule  from  Theorem  2.3.2  to 
guarantee  finite  termination. 


(c)  DPC  vs.  CPLEX  (d)  DPC  vs.  PC 


Figure  2.2:  (a)  number  of  PC  iterations  performed,  (b)-(d)  runtime  comparisons 


We  compare  the  runtimes  of  the  centralized  and  distributed  versions  of  PC  to  those  of  CPLEX’s 
MILP  solver,  using  4713  randomly  generated  problems.  We  varied  between  2  to  10  subproblems, 
10  to  200  variables,  and  41  to  900  total  clauses,  of  which  0.1 1%  to  17.65%  were  shared.  The  ratio 
of  the  number  of  clauses  to  the  number  of  variables  was  set  to  between  4.0  and  4.5  to  keep  the 
problems  difficult.  In  the  centralized  version  of  PC,  subproblems  were  solved  sequentially  on  one 
CPU,  alternating  with  the  master  program,  incurring  no  communication  cost.  The  distributed  runs 
were  performed  on  two  8 -core  machines,  where  the  subproblem  solvers  communicated  with  the 
master  over  sockets.  One  process  was  dedicated  to  solving  the  restricted  master  LP  and  making 
cuts. 

4 We  also  tested  the  MiniSat+  solver  [23],  a  specialized  zero-one  IP  solver  based  on  a  SAT  solver,  MiniSat. 
Depending  on  problem  characteristics,  it  was  sometimes  faster  than  CPLEX  and  sometimes  slower.  We  report  results 
using  CPLEX  since  it  is  more  general. 
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Fig.  2.2(a)  shows  the  distribution  of  the  number  of  iterations  of  PC  to  reach  optimality.  Most 
cases  required  only  a  few  iterations,  and  only  34  cases  out  of  4713  required  more  than  10.  In 
Fig.  2.2(b),  each  point  represents  a  problem;  on  the  x  axis  is  the  runtime  of  CPLEX,  and  on  the  y 
axis  is  that  of  centralized  PC,  both  in  log  scale.  The  diagonal  line  is  the  identity:  any  point  below 
the  line  represents  a  problem  instance  where  PC  outperforms  CPLEX.  Our  observations  suggest 
that  CPLEX  running  time  is  heavily  dependent  on  the  number  of  total  clauses  in  the  problem.  We 
can  see  here  that  PC  outperforms  CPLEX  handily  for  larger  problem  sizes,  as  the  advantage  of 
market-based  planning  outweighs  the  overhead  in  our  implementation:  PC  outperformed  CPLEX 
on  92.38%  of  the  instances  where  CPLEX  took  more  than  1  second. 

Fig.  2.2(c)  is  an  analogous  plot  for  the  distributed  version  of  PC  (DPC),  and  exhibits  similar 
trends,  if  not  more  pronounced.  Finally,  Fig.  2.2(d)  gives  a  similar  comparison  between  PC  and 
DPC:  DPC  outperformed  PC  on  96.12%  of  the  instances  where  PC  took  more  than  1  second.  It  is 
interesting  to  note  that,  even  for  problems  which  take  only  Is  to  solve,  the  benefit  of  parallelism 
outweighs  communication  overhead,  despite  our  simple  implementation. 

The  bottom  horizontal  “tier”  of  points  in  Figs.  2.2(b)  and  2.2(c)  contains  almost  exclusively  infea¬ 
sible  instances;  as  expected,  PC  is  very  efficient  at  detecting  infeasibilities  when  a  subproblem  is 
infeasible.  In  Fig.  2.2(c),  additional  tiers  represent  the  effects  of  parallelism:  different  tiers  roughly 
correspond  to  different  percentages  of  shared  constraints  in  the  problem,  with  smaller  percentages 
corresponding  to  lower  tiers,  where  more  execution  time  is  spent  in  subproblems  (which  can  be 
parallelized)  instead  of  the  master  (which  is  not  parallelized  in  our  implementation). 


Figure  2.3:  (a)  an  instance  of  the  UAV  planning  problem  (b)  runtime  comparisons,  PC  vs.  CPLEX 


2.5.2  UAV  Task  Allocation  and  Path  Planning 

We  also  performed  experiments  on  UAV  task  allocation  and  path  planning  problems,  using  the 
MILP  formulation  presented  in  [69].  We  again  compared  the  runtimes  of  PC  and  CPLEX,  in  the 
setting  shown  in  Fig.  2.3(a),  which  was  studied  in  [69].  In  this  particular  configuration,  CPLEX 


16 


took  166s,  which  is  comparable  to  numbers  reported  in  [69],  almost  a  decade  ago,  whereas  PC 
found  the  answer  in  31s.  We  studied  150  instances  based  on  this  map,  with  randomly  generated 
targets.  We  used  as  the  objective  the  sum  of  the  agents’  finish  times;  there  is  little  computational 
advantage  to  using  our  decomposition  under  the  maximum  of  finish  times  objective  used  in  [69], 
as  it  creates  a  star  graph  between  the  agents.  We  observed  trends  similar  to  the  3SAT  timing 
experiments:  PC  outperforms  CPLEX  in  96.9%  of  the  instances  where  CPLEX  took  more  than 
10s.  The  mean  runtimes  were  7.24s  and  30.7s  for  PC  and  CPLEX  respectively,  and  max  times 
26.85s  and  624.99s. 


2.6  Conclusion 

In  this  chapter,  we  laid  out  the  foundations  of  our  framework,  upon  which  we  will  build  in  the 
next  chapters.  We  presented  a  principled  distributed  market-based  algorithm  for  combinatorial 
optimization  and  multi-agent  planning  that  automatically  and  optimally  prices  shared  resources. 
We  provided  optimality  and  convergence  conditions  for  a  general  class  of  MILPs  that  includes 
our  multi-agent  formulation.  Our  experiments  show  general  applicability  of  price-and-cut,  in 
both  centralized  and  distributed  settings.  In  the  following,  we  will  examine  how  to  obtain  a  more 
efficient,  albeit  less  generally  applicable,  algorithm  by  relaxing  the  MILP  framework. 
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Chapter  3 

Scaling  Up  with  Lagrangian  Relaxation 


Mixed  integer  linear  programs  (MILPs)  are  a  versatile  language  for  formulating  and  solving 
various  multi-agent  problems  with  shared  constraints,  and  in  Chapter  2  we  explored  a  distributed 
method  to  solve  them  via  decomposition.  Still  yet,  MILPs  are  often  quite  time-consuming  to 
solve,  and  thus  are  not  ideal  for  domains  that  require  real-time  or  near-real-time  planning.  One 
such  domain  is  that  of  automated  driving,  which  one  can  frame  as  a  distributed  planning  problem, 
where  each  car  plans  its  own  path  while  communicating  with  other  cars  or  a  master  program 
to  satisfy  global  constraints  that  prevent  collisions.  In  this  chapter,  we  will  study  a  scenario 
involving  a  heavily  congested  crossroad,  where  automated  vehicles  plan  and  resolve  conflicts  by 
communicating  via  a  master  program,  with  the  goal  of  reaching  their  destinations  safely. 

A  popular  approach  to  solving  a  difficult  MILP  is  to  employ  linear  program  (LP)  relaxation,  where 
integral  constraints  on  the  variables  are  relaxed  such  that  the  variables  become  continuous.  LP 
relaxation  is  a  special  case  of  Lagrangian  relaxation  [25],  which  relaxes  a  subset  of  the  constraints 
of  a  MILP  to  obtain  what  is  hopefully  an  easier  problem  to  solve.  However,  no  guarantees  exist 
in  general  for  the  quality  of  a  Lagrangian  relaxation  solution.  In  this  chapter,  we  describe  a 
Lagrangian  relaxation  of  the  shared  constraints  that,  in  conjunction  with  randomized  rounding 
and  “cushioning”  the  shared  constraints,  can  be  shown  to  yield  a  near-optimal  solution  to  the 
original  MILP  [31].  In  particular,  the  guarantees  hold  for  problems  with  a  large  number  of  agents, 
where  many  agents  contend  for  each  resource.  This  form  of  Lagrangian  relaxation  also  allows  us 
to  keep  the  subproblems  integral,  and  use  the  same  subproblem  solvers  as  we  would  for  solving 
the  MILP  via  price-and-cut  (Chapter  2).  Our  experiments  demonstrate  that  the  number  of  violated 
constraints  can  be  far  reduced  in  the  Lagrangian  relaxation  solutions  compared  to  non-coordinated 
solutions,  even  in  planning  problems  that  are  small  enough  that  the  theoretical  guarantees  are  not 
very  tight. 

Analogously  to  employing  Dantzig-Wolfe  decomposition  (Chapter  2)  to  a  LP  relaxation  to  obtain 
a  distributed  iterative  algorithm,  we  can  optimize  the  Lagrangian  relaxation  by  using  simple 
gradient-based  optimization  methods  on  the  dual.  As  in  Dantzig-Wolfe  decomposition,  dual 
variable  values  being  optimized  act  as  prices  for  the  resources  so  that  the  subproblems  can  produce 
feasible  plans  accordingly.  Conversely,  computing  the  subgradient  of  the  dual  function  amounts 
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to  planning  in  each  agent’s  sub-problem  independently,  using  the  current  resource  prices.  In 
this  chapter,  we  give  two  gradient-based  optimization  methods  for  Lagrangian  relaxation  with 
different  convergence  rates,  sugradient  descent  and  an  accelerated  gradient  method,  FISTA. 

In  particular,  we  discuss  how  to  augment  factored  MDPs  for  optimization  with  FISTA,  which 
requires  a  partly  “smooth”  objective  function  with  a  Lipschitz  continuous  gradient.  We  consider 
two  penalization  functions  for  factored  MDPs:  the  quadratic  loss  penalty  and  the  maximum 
causal  entropy  penalty  [98].  In  particular,  the  maximum  causal  entropy  penalty  yields  an  efficient 
subproblem  solver,  softmax  value  iteration,  allowing  us  to  maintain  fast  subproblem  planning  for 
individual  agents  and  thus  an  efficient  computation  of  the  gradient.  Unfortunately,  augmentation 
can  lead  to  errors  in  solution  with  respect  to  the  original  objective.  However,  we  show  that  1. 
we  can  bound  the  errors  introduced  by  augmentation,  and  2.  despite  such  errors,  empirically  we 
can  achieve  convergence  to  an  optimal  solution,  at  a  dramatically  faster  rate  than  that  of  exact 
optimization  with  subgradient  method. 

We  begin  by  introducing  our  motivating  domain  of  automated  path  planning  in  traffic. 


3.1  Focus  Domain:  Traffic  Management 
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Figure  3.1:  An  instance  of  the  planning  domain  of  X-shaped  bridges,  formulated  as  a  factored 
MDP  for  the  deliberative  planner,  (a)  shows  the  states,  each  a  yellow  shaded  numbered  rectangle. 
Bold  lines  indicate  boundaries  that  cannot  be  crossed  by  vehicles.  In  (b),  rounded  rectangles 
are  the  resources,  which  are  overlapping  patches  of  the  road  (different  colors  were  used  only  to 
distinguish  between  overlapping  rectangles). 


We  use  X-shaped  bridges  and  interchanges  as  an  example  of  a  difficult  traffic  coordination  problem, 
as  vehicles  are  forced  to  change  lanes  and  contend  for  shared  road  space.  Such  interchanges  are  a 
common  culprit  in  heavy  traffic,  since  agents  must  often  cross  many  lanes  in  a  short  section  of 
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a  road.  Many  examples  of  such  interchanges  can  be  found  in  Pittsburgh,  the  most  infamous  of 
which  is  the  Fort  Pitt  Bridge. 

To  study  this  domain,  we  built  a  simulator  which  computes  the  positions  and  the  movements  of 
the  vehicles  at  a  high  frequency  to  simulate  continuous-time  dynamics  and  interactions.  The 
final  paths  of  the  agents  are  based  on  two  planners:  a  deliberative  planner ,  which  is  used  before 
execution  to  map  out  a  coordinated  plan  and  is  the  focus  of  this  chapter,  and  a  reactive  planner, 
which  is  run  at  execution  time  on  each  agent  to  ensure  collision  avoidance  while  following  the 
more  coarse-grained  plans  output  by  the  deliberative  planner.  The  reactive  planner  also  helps 
agents  replan  under  unexpected  events  such  as  lane  closures  or  accidents  in  this  domain  (for 
some  of  which  we  can  plan  during  the  deliberative  stage  using  an  algorithm  such  as  that  given  in 
Chapter  4). 

For  the  deliberative  planner,  we  formulate  each  agent’s  path  planning  problem  as  a  deterministic 
Markov  decision  process  (MDP).  The  global  problem  then  becomes  a  factored  MDP  where  agents 
interact  with  one  another  through  constraints  on  shared  resources.  In  Figure  3.1(a)  we  depict 
an  example  problem  instance,  where  states  (physical  locations  of  vehicles)  are  represented  by 
yellow  rectangles  labeled  with  state  IDs.  Each  agent  starts  at  one  of  the  first  five  states  (1-5, 
i.e.,  entering  the  bridge)  and  must  reach  its  goal  state  (randomly  picked  as  one  of  the  last  states, 
here  49-52,  i.e.,  exiting  the  bridge).  In  each  state,  agents  are  allowed  to  move  in  four  directions, 
corresponding  to  staying  in  the  current  state  or  moving  in  each  forward  direction  (moving  one  state 
to  the  left-forward,  straight,  or  right-forward),  but  cannot  cross  the  solid  lines.  The  constraints  in 
this  domain  encode  collision  avoidance  by  defining  resources  as  patches  of  the  road  as  shown  in 
Figure  3.1(b)  and  only  allowing  a  limited  number  of  agents  to  occupy  a  patch  at  a  time.  In  the 
following  sections  ,  we  describe  the  Lagrangian  relaxation  algorithm  for  solving  general  MILPs, 
as  well  as  its  specialization  to  deterministic  factored  MDPs. 

The  reactive  planners  (and  thus  the  simulations)  are  run  at  a  high  frequency  with  much  smaller 
time  steps  than  that  used  by  the  MDP,  and  use  a  potential-field  method  for  avoiding  collisions: 
we  apply  a  repulsive  force  to  the  control  targets  of  any  pairs  of  cars  which  approach  too  closely. 
In  particular,  the  reactive  planners  are  proportional-derivative  (PD)  controllers  based  on  the 
differentially-flat  representation  of  a  differential-drive  robot  (a  “unicycle”).  We  note  that  while 
this  representation  is  simpler  than  an  Ackermann-steered  car,  it  is  adequate  for  our  purposes,  since 
the  cars  are  always  moving  forward  with  lower-bounded  velocity. 


3.2  Lagrangian  Relaxation 


First  we  define  our  mathematical  program,  which  is  a  specialization  of  our  mixed  integer  pro¬ 
gramming  (MILP)  framework.  We  focus  on  linear  objective  functions  for  simplicity  of  discussion; 
however,  the  objective  may  in  fact  be  composed  of  arbitrary  convex  functions  with  respect  to 
each  agent’s  local  variables.  We  allow  arbitrary  subproblem  constraints,  only  requiring  the  sub¬ 
problem  solutions  to  be  bounded  (so  that  the  subgradient  is  bounded).  We  assume  that  the  shared 
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constraints  are  linear1 . 

Following  a  similar  notation  to  that  for  MILP  in  Equations  (2.1. 1)— (2. 1.3),  we  can  write  our 
program  as: 


max  E  cjxi  (3.2.1) 

i=l:n 

s.t.  DiXi  <  b  (3.2.2) 

i=l:n 

Xi  G  Ci,  i  =  1, . . . ,  n,  (3.2.3) 

where  (3.2.2)  corresponds  to  shared  constraints,  and  (3.2.3)  to  an  individual  agent’s  constraints. 
In  comparison  to  (2.1.2),  we  use  inequality  constraints  for  convenience,  though  without  loss  of 
generality. 

We  now  introduce  an  approximation  that  will  enable  us  to  employ  more  computationally  efficient 
forms  of  the  shared  constraints. 


3.2.1  Approximating  Shared  Constraints 

The  size  of  the  domain  (i.e.,  the  bounds  on  the  variables)  in  an  optimization  problem  inherently 
affects  the  runtime  of  its  solver.  For  example,  consider  a  convex  function  /  with  variables 
x  G  X  CK",  and  define  the  radius  of  the  domain  as  R  =  supxeX  1 1 x  —  ,x*||  with  respect  to  a 
optimal  solution  x*.  In  the  worst  case,  a  first-order  optimization  method  such  as  subgradient 
method  incurs  error  e  =  f(x^)  —  fix*)  at  e  =  o(R2/k2)  at  any  iteration  k  <  n  (Theorem 
3.2.1,  [58]);  intuitively,  the  algorithm  must  reach  the  optimal  point  from  the  starting  point,  which 
if  we  are  unlucky,  could  be  as  far  as  traversing  the  entire  space. 

As  we  will  see  in  the  following  section,  our  optimization  variables  are  the  Lagrange  multipli¬ 
ers ,  and  using  hard  constraints  results  an  unbounded  domain,  which  can  break  the  standard 
convergence  proofs  that  depend  on  the  radius  R  above  for  finite-time  convergence,  for  algo¬ 
rithms  like  subgradient  method.  Hence,  here  we  limit  the  domain  of  the  Lagrange  multipliers  by 
approximating  the  hard  constraints  with  piece-wise  linear  functions  of  the  form 

f(y)  =  ma x{ay  +  0\  (a,0)  e  F}, 

where  F  is  a  finite  set  of  slope-intercept  pairs  (a,  f)  (following  Gordon  et  al.  [31]).  Note  that  the 
approximation  is  in  fact  quite  flexible:  with  a  large  enough  slope  a  piece- wise  linear  function 
can  closely  approximate  hard  constraints2,  and  with  a  large  enough  number  of  segments,  it  can 
approximate  smooth  functions  to  any  approximation  accuracy  required. 

1  Or  convex,  since  we  can  approximate  any  convex  constraint  with  several  linear  constraints. 

2  In  fact,  for  any  given  problem  there  exists  a  sufficiently  large  slope  for  the  hinge  loss  (defined  shortly)  that  there 
is  no  difference  between  hinge  and  hard  constraint. 
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In  more  detail,  for  each  resource  j,  we  can  replace  the  shared  constraint  Yhi= i-n  Aijxi  <  bj  by 
adding  to  the  objective  a  function  fj  of  resource  usages,  rendering  the  overall  program  to  be: 

max  E  cJxi  ~  E  /i(E  ~ (3.2.4) 

i=l:n  j=l:m  i=l:n 

s.t.  Xi  G  Ci,  i  =  1, . . . ,  n.  (3.2.5) 

A  useful  instance  of  fj  is  the  hinge  loss 

ff(y)  =  max(0  ,ctjy), 

which  softens  the  hard  constraints  but  still  penalizes  over-usage  of  resources,  with  a3  controlling 
the  strength  of  the  soft  constraint.  Hinge  loss  is  convex  and  continuous,  making  it  suitable  for 
efficient  optimization,  and  can  approximate  hard  constraints  with  a  sufficiently  large  a3.  It  is  also 
flexible;  for  example,  the  “hinge  point”  can  be  moved  to  “cushion”  the  resource  usage,  to  further 
encourage  the  resource  constraints  to  be  met.  For  these  advantages,  hinge  loss  will  be  our  fj  of 
choice  for  the  traffic  management  domain. 


3.2.2  Lagrangian  Relaxation 

We  are  now  ready  to  define  the  Lagrangian  relaxation  for  our  problem.  We  obtain  the  Lagrangian 
relaxation  of  (3.2.4)-(3.2.5)  by  dualizing  the  shared  constraints  in  (3.2.4).  The  problem  then 
becomes  finding  the  optimal  dual  value 

V*  =  inf  V  (A)  (3.2.6) 

A 

=  inf  max  ^  cfxt  +  ^  [ff(Xj)  “  A/(  E  DijXi  ~  ^  S-t  Xi  G  ^Q-2-7) 

i=l:n  j=l:m  i=l:n 

where  A,  are  the  dual  variables,  and  f  * (X,)  =  sup  jA7vr  —  fj(z)]  is  the  Lagrangian  dual  of  fj.  V*, 
the  optimal  dual  value,  provides  an  upper  bound  on  the  value  of  (3.2.4)-(3.2.5). 

Rearranging  (3.2.7),  we  get 

1/*  =  inf[  [f*(Xj)  +  A jbj]  +  max[  £  {cj  -  E  XjDn)xi\]  s.t.  Vi,  Cit  (3.2.8) 

j=l:m  i=l:n  j=l:m 

which  reveals  two  terms:  the  first  sum  discourages  the  dual  values  A  from  growing  too  large,  and 
the  latter  optimizes  the  sum  of  agents’  utilities,  with  penalty  for  each  one’s  resource  usage  priced 
by  A. 


3.2.3  Optimization 

We  can  solve  Equation  (3.2.7)  by  iteratively  updating  A  using  a  first-order  method  such  as 
subgradient  projection.  In  Section  3.3,  we  discuss  different  gradient-based  methods  and  their 
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applicability  to  different  types  of  objective  functions,  as  well  as  how  to  obtain  a  feasible  solution 
to  the  original  MILP  given  a  near-optimal  solution  to  the  Lagrangian  relaxation.  Here  we  discuss 
how  to  compute  the  subgradient  (needed  for  all  first-order  methods). 

It  is  straight-forward  to  compute  the  subgradient  of  V (A),  the  dual  value  at  a  specific  value  of  A. 
First  we  rewrite  L(A)  using  the  definition  of  /*,  the  Lagrangian  dual  of  fy. 

V (A)  =  [max[Ayz  —  fj(z)\  +  A, -6,-]  +  maxfY^  (cj  —  A yZAy)ay]  s.t.  Vi,  C\. 

j=l:m  i=l:n  j=l:m 

Now,  let  z*  =  arg maxz [XjZ  -  fj(z)\  and  x*  =  argma xXiS,tCi[(cI  ~  J2j=i:mXjDij)xi\-  Then 
the  subgradient  of  V (A)  with  respect  to  Ay  is 

d[V(\j)}  =  z-+bj-YJDiix’. 

i 

In  the  case  of  the  hinge  loss  fy1  (z),  for  0  <  Ay  <  ay,  we  can  set  z*  =  0  to  obtain  a  subgradi¬ 
ent 

i 

which  corresponds  exactly  to  the  excess  usage  of  resource  j,  giving  rise  once  again  to  the  price 
interpretation  of  the  dual  variables. 

The  form  of  the  subgradient  shows  that  the  computation  of  the  subgradient  is  easily  distributed,  as 
the  subproblems 

max[(cj  —  XjDij)xi]  s.t.  CL 

Xi 

j=l:m 

are  independent  given  A  and  thus  ay  can  be  found  in  parallel.  Hence  Lagrangian  relaxation  leads 
to  a  naturally  distributed  algorithm  where  each  agent’s  planning  may  be  performed  in  parallel  and 
communication  is  achieved  solely  through  A  and  the  resource  usages. 

Note  that  since  fj  is  piece- wise  linear,  the  dual  function  f*  imposes  constraints  on  Ay:  a“im  < 
A  j  <  a“ax,  where  a™  and  aj,ax  are  the  smallest  and  largest  slopes  of  the  linear  segments 
respectively.  At  this  point  we  have  enough  information  to  set  up  the  simplest  Lagrangian  relaxation 
algorithm:  we  simply  use  projected  subgradient  descent  to  optimize  A  over  the  box  constraints 
o:mm  <  A  <  amax.  The  projection  step  is 

A  j  max(a“m,  min(a™ax,  Ay)), 

and  the  subgradient  is  given  in  (3.2.9). 

Note,  however,  that  the  projected  subgradient  method  gives  us  directly  only  a  value  for  the  dual 
price  vector  A.  We  discuss  below  how  to  go  back  to  a  primal  solution;  the  main  issue  is  that, 
due  to  the  duality  gap,  standard  methods  for  obtaining  a  primal  solution  are  only  guaranteed  to 
satisfy  the  original  hard  shared  constraints  in  expectation.  Hence,  additional  steps  are  required  to 
guarantee  a  feasible  solution  to  the  original  MILP;  we  describe  a  simple  randomized  rounding 
scheme  in  Section  3.3.3  for  obtaining  a  near-optimal  solution  to  the  MILP. 
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3.2.4  Lagrangian  Relaxation  for  Factored  MDPs 


We  now  describe  how  to  apply  Lagrangian  relaxation  to  deterministic  factored  MDPs,  which  we 
use  to  formulate  the  traffic  management  domain. 

A  general  multi-agent  MDP  can  be  written  as  an  MDP  whose  size  is  exponential  in  the  number 
of  agents:  its  state  and  action  spaces  are  the  cross-product  of  all  agents’  state  and  action  spaces. 
However,  as  we  have  observed,  a  multi-agent  planning  problem  is  mostly  factored,  where  each 
agent’s  local  problem  can  be  isolated  from  the  global  problem.  Hence  we  can  define  a  factored 
MDP  in  which  the  state  and  action  probabilities  of  each  agent  only  depends  on  other  agents 
through  a  set  of  shared  constraints. 

Let  random  variables  Alt  and  Slt  represent  the  action  and  state  of  agent  i  at  time  t  respectively. 
For  agent  i,  given  the  distribution  over  the  initial  states  P(,S',i ),  the  transition  probabilities 
P(Sijt+i\Ait,  Sit),  and  the  rewards  rAiuSit,  the  goal  of  MDP  planning  is  to  find  a  policy,  P(Alt\Slt), 
that  optimizes: 


T 


max  V  V  rAiusitP(Ait,Sit) 

P(Ait\Sit),P(Ait,Sit)  f-'  /-£ 
t— 1  An, Da 

(3.2.10) 

1— 1 

Sn ) 

=  P(An\Sii)P(Sii),  yi,An,Sn 

(3.2.11) 

P{Ait- 

,sit) 

=  P(Alt\Slt)J2  PiSitlAt-iiSi^PiAi^lSi^), 

Si,t  —  1  Ai't-  1 

VM  >  1,  Ait,  Sit 

(3.2.12) 

Y,  P(Ait,Sit)  =  1,  Vz,f 

An  i^it 

(3.2.13) 

YP(At\Sit)  =  l,  Vi,  t,  Sit 

A  -x 

(3.2.14) 

Axt 

P(Ait\Sit)  G  {0,1},  Vi,t,  An,  Sn. 

(3.2.15) 

For  this  chapter  we  will  focus  on  deterministic  MDPs,  where  the  initial  state  distribution  and  the 
transition  probabilities  are  assumed  to  be  deterministic,  i.e., 

P(Sn)  G  {0,1},  P(Sit\Aitt~i,  £  {0,1}, 

and  the  agents’  (deterministic)  policies  completely  determine  the  joint  state  at  all  times.  Since  a 
deterministic  optimal  policy  always  exists  for  a  MDP,  constraints  (3.2.15)  ensure  that  such  a  policy 
is  picked  over  one  that  may  have  non-integral  policy  probabilities  due  to  ties  in  the  value  function. 
In  turn,  the  joint  state-action  probabilities  P(Ait ,  Slt),  which  are  directly  computable  from  the 
policy,  transition  probabilities,  and  the  initial  state  distribution,  are  also  deterministic.  Hence, 
(3.2.1 1)— (3.2.15)  define  a  MILP,  yielding  shared  constraints  on  indivisible  resources: 
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Note  that  the  program  defined  by  (3.2.10)-(3.2.15)  is  not  a  MILP,  since  constraints  (3.2.12)  are 
not  linear.  However,  since  P(Sit\Ai^1,  Sijt- 1)  and  P(Sn)  are  binary,  we  can  obtain  an  equivalent 
MILP  by  writing  each  of  the  constraints  as  an  equivalent  set  of  linear  constraints.  Here  we  leave 
them  in  the  more  intuitive  form  for  readability.  For  optimization  purposes,  the  exact  expression 
we  use  here  for  the  constraints  has  no  effect,  since  our  Lagrangian  relaxation  algorithm  allows  us 
to  solve  each  agent’s  MDP  using  value  iteration. 

Using  hinge  loss  fy'  as  shared  constraint  penalty,  the  global  multi-agent  optimization  problem 
becomes 


max  )  \ 

2=1 


:P(A,  Si)  -  E  /f(E  DVP(^  Si)  -  bi)  s.t.  Vi,  C,  (3.2.16) 


j= i 


where  Ct  denotes  the  constraint  set  defined  by  equations  (3.2. 1 1)— (3.2. 14)  for  each  agent 
i,  and  we  have  vectorized  rewards  r,:  and  the  state-action  probabilities  P{Al.  St)  = 
[P(An,  Sn) . . .  P(Aix,  Sir)}-  As  usual,  we  denote  matrices  D\,...,Dn  such  that  Di  de¬ 
fines  the  linear  resource  usage  of  agent  for  factored  MDPs,  resource  usage  is  a  function  of  the 
state-action  probabilities,  DijP(Ai,  Si),  for  agent  i  and  resource  j.  For  example,  for  the  traffic 
management  problem,  the  resource  usage  for  resource  j,  a  patch  of  road,  is  of  the  form 


E  E  -^(Positi°n  °f  agent  i  indicated  by  Sit  is  within  road  patch  j)P(Sit) 

i  t 


where  /  is  an  indicator  function,  as  we  aim  to  limit  the  number  (or  more  precisely,  the  sum  of 
the  probabilities)  of  agents  in  a  particular  region  of  road  at  the  same  time.  (Note  that  P(Su)  = 
J2au  p(Ait,  Sit),  hence  the  resource  usages  are  linear  in  P(Ait.  Slt).) 

Now,  letting  vectors  Xi  =  P(Ai ,  St)  for  notational  simplicity,  the  Lagrangian  relaxation  of  (3.2.16) 
gives  us  the  optimization  problem 


n 

inf  V(X)  =  inf  max  rjxi 

A  v  '  A  a;  ^  1 
2=1 


+  -  -M  E  D<ix<  -  M  s-‘-  MCV 

j— 1  i=l:n 


(3.2.17) 


Expanding  the  dual  function  of  hinge  loss 

X 


we  obtain 


n  m 

V  (A)  =  max  E  rjxi  +  Eato  E  Aj  E  Oij)  +  A j(bj  ^  ^  Djj Xi ) ]  s.t.  Vi,  Cj , 

i=l  j  1  i 


where  again  ay  is  the  slope  of  hinge  loss  fP,  and  the  indicator  function  I(Z)  equals  oo  if  condition 
Z  is  not  satisfied,  and  equals  0  otherwise. 


Note  that  V (A)  is  subdifferentiable  for  A  s.t.  X:I  e  [0,  oy],  Vy,  with  subgradient  shown  in  Equation 
(3.2.9).  To  compute  the  gradient,  x*  can  be  calculated  by  each  agent  %  in  a  distributed  fashion 
given  the  current  A.  In  particular,  for  factored  MDPs,  we  can  use  value  iteration  to  efficiently 
solve  each  subproblem  to  find  x*. 
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3.3  Optimization 


Here  we  discuss  two  gradient-based  methods,  the  subgradient  method  and  an  accelerated  gradient 
descent  method,  FISTA,  for  finding  a  solution  to  the  Lagrangian  relaxation,  and  how  they  can 
be  applied  to  the  Lagrangian  relaxation  of  factored  MDPs.  We  then  show  how  to  obtain  a  near- 
optimal  feasible  solution  to  the  original  MILP  once  we  have  found  a  solution  to  the  Lagrangian 
relaxation. 


3.3.1  Subgradient  Projection 

Subgradient  projection  is  a  general  and  robust  gradient-based  algorithm,  requiring  only  a  subgra¬ 
dient  at  each  point  in  the  domain.  Applying  subgradient  projection  to  the  Lagrangian  relaxation 
problem  (3.2.7)  using  the  subradient  and  the  projection  operator  described  in  Section  3.2.3  leads 
to  the  subgradient  Lagrangian  relaxation  (SLR)  algorithm  introduced  by  Gordon  et  al.  [31],  which 
we  show  here  in  Algorithm  3. 


Algorithm  3  The  subgradient  Lagrangian  relaxation  algorithm 
Inputs:  r],  the  step  size  constant.  T,  number  of  iterations. 
Outputs:  Xi,  average  policies.  A j,  average  dual  values. 

Ajo  4—  0 

for  t<r-  1,2,  ...,T 

•  Vi,  x*t  arg  maxa,i  s.t.  Ci  (q  -  A**  A)T^i 

•  Vj,  z*  +-  arg  sup  JA jtz  -  fj(z )] 

•  Vj,  Xjt  <-  Xj,t-i  ~  +  bj  -  J2i  DijX*t) 

•  Vj,  Xjt  max(a™n,  min(ct™ax,  A jt)) 

=  |ELi  x*k 

^3i  —  t  1  Aa 


Algorithm  3  encodes  one  of  the  standard  methods  for  recovering  a  primal  solution  from  the  dual 
subgradient  method:  one  of  the  outputs  of  the  algorithm  is  the  average  policy  x^  which  can  be 
used  to  find  a  feasible  solution  to  the  original  MILP  by  rounding  as  described  in  Section  3.3.3.  The 
solution  x*t  at  some  iteration  t  is  not  guaranteed  to  be  a  convergent  solution,  since  the  objective 
function  may  not  necessarily  be  smooth  and  thus  the  solution  may  oscillate  indefinitely3. 

Nevertheless,  the  average  solution  x^  converges  the  set  of  optimal  solutions  and  the  cost  of 
Xi  converges  to  the  optimal  cost.  SLR,  as  per  standard  results  about  subgradient  method,  has 

3  For  example,  consider  a  hinge  loss  function  for  a  single  resource  with  a  large  number  of  agents.  Let  every  agent 
be  identical,  and  let  each  agent’s  reward  be  —1  if  it  did  not  use  the  resource,  and  1  if  it  did.  Then  there  are  two 
policies  for  each  agent:  “use  the  resource”  and  “do  not  use  the  resource”.  Now,  since  all  agents  are  equivalent,  at 
each  iteration  every  agent  will  compute  the  same  policy  given  the  resource  price,  and  the  resource  price,  i.e.,  the  dual 
variable,  will  oscillate:  if  all  agents  choose  not  to  use  the  resource,  the  resource  costs  will  be  reduced,  hence  in  the 
next  iteration  all  agents  will  choose  to  use  it,  which  raises  the  resource  price,  and  the  cycle  repeats. 
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a  convergence  rate  of  O(^)  on  general  convex  objective  functions,  hence  to  acquire  an  e- 
approximate  solution  requires  0(j>)  iterations. 


3.3.2  Accelerated  Gradient  Methods 

While  subgradient  method  is  a  very  robust  optimization  algorithm,  it  suffers  from  a  slow  conver¬ 
gence  rate.  Quasi-Newton  methods  such  as  L-BFGS  are  a  popular  alternative  to  the  subgradient 
method,  and  have  seen  much  success  in  practice  [54].  However,  they  assume  a  smooth  (twice- 
differentiable)  objective  function  in  general,  and  hence  are  not,  at  least  theoretically,  suitable  for 
most  versions  of  the  constraint  penalty  function  /4.  Here  we  focus  on  an  accelerated  gradient 
method,  FISTA  [7],  which  can  handle  objective  functions  with  non-smooth  parts.  We  describe 
how  FISTA  can  be  used  with  Lagrangian  relaxation.  Then  in  Section  3.4,  we  give  an  augmented 
objective  for  the  factored  MDP  domain  for  optimizing  with  FISTA  and  demonstrate  the  benefits 
of  the  accelerated  gradient  method  on  efficient  optimization  of  the  Lagrangian  relaxation  on  the 
traffic  management  problem. 

FISTA  is  guaranteed  to  converge  to  an  e-approximate  solution  in  0(1/  y/e)  iterations  given  an 
objective  function  with  a  smooth  component5.  More  precisely,  FISTA  works  on  an  objective 
function  F  =  f  +  g  comprised  of  two  functions  /  and  g,  where  /  :  R"  -A  E  is  a  continuous 
convex  but  potentially  non-smooth  function,  and  g  :  Rn  -A  E  is  a  smooth  function  with  a  Lipschitz 
continuous  gradient,  i.e.,  for  all  x,  y  €  Mn, 

II Vg(x)  -Vg(y) ||  <  L||x-y||, 

where  L  >  0  is  called  a  Lipschitz  constant  for  function  V g.  Intuitively,  the  existence  of  a  Lipschitz 
constant  for  the  gradient  of  a  function  implies  that  the  gradient  cannot  change  too  quickly,  and 
thus  the  function  is  smooth,  which  is  generally  nicer  to  optimize. 

If  our  objective  functions  q  (:c)  are  strongly  convex,  then  the  Lagrangian  dual  will  have  a  Lipschitz 
continuous  gradient  with  respect  to  A,  and  we  can  directly  apply  FISTA.  However,  if  ct(x)  are  not 
strongly  convex,  such  as  in  the  case  of  linear  objective  functions,  we  can  augment  the  objective 
with  a  strongly  convex  term  such  as  a  quadratic  loss: 

9(x)  = 

i 

whose  strength  can  be  controlled  by  a  parameter  which  will  also  affect  the  Lipschitz  constant  L. 
If  the  original  problem  is  linear,  adding  a  quadratic  term  turns  the  subproblems  into  quadratic 
programs,  which  can  be  solved  using  a  variety  of  existing  quadratic  program  solvers. 

The  solution  to  this  augmented  program  may  not  correspond  to  the  optimum  to  the  original 
program.  However,  it  is  possible  to  bound  the  error  based  on  the  penalty  term  and  the  subproblem 

4However,  L-BFGS  can  be  fast  and  robust  in  practice  when  used  with  augmentation,  as  we  show  in  Chapter  4. 

5  Subgradient  method  achieves  0(1 /e)  convergence  for  the  same  class  of  functions,  hence  we  can  use  the 
augmentation  methods  in  this  section  to  accelerate  convergence  using  subgradient  method,  albeit  to  a  slower  rate  than 
using  FISTA. 
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solver,  as  we  show  in  Section  3.4  for  factored  MDPs.  We  can  then  account  for  the  smoothing 
error  as  part  of  the  approximation  error  for  Lagrangian  relaxation  in  achieving  an  e-approximation 
solution.  That  is,  we  trade  the  error  introduced  by  the  penalty  term  against  the  error  introduced 
by  incomplete  convergence  of  the  optimizer  given  a  fixed  iteration  budget.  In  Section  3.4.2  we 
demonstrate  experimentally  that  the  solution  to  the  augmented  problem  gives  us  near-optimal 
solutions,  while  achieving  dramatically  faster  convergence  than  subgradient  projection. 

We  note  that  unlike  the  subgradient  method,  FISTA  will  give  convergent  primal  solutions.  How¬ 
ever,  since  due  to  Lagrangian  relaxation  the  primal  solutions  obtained  will  only  satisfy  the  shared 
constraints  in  expectation,  and  furthermore  here  they  are  solutions  to  the  augmented  problem,  we 
must  still  find  a  feasible  solution  to  the  original  MILP.  We  describe  one  method  to  do  so  in  the 
following  section. 


3.3.3  Randomized  Rounding 

Given  an  e-approximate  (feasible)  solution  to  the  Langrangian  relaxation  dual  obtained  via 
methods  above,  we  must  find  a  feasible  primal  solution,  i.e.,  plans  for  agents.  Here  we  give  a 
simple  randomized  rounding  scheme  that  gives  near-optimal  primal  solutions  when  a  large  number 
of  agents  contend  for  each  resource. 

Given  x,  a  distribution  over  deterministic  plans,  which  may  be  an  average  solution  from  the 
subgradient  method  or  a  solution  from  FISTA  on  the  augmented  program,  each  agent  can  indepen¬ 
dently  sample  a  deterministic  individual  plan  (where  only  one  action  is  taken  given  a  state).  The 
sampled  set  of  plans  may  violate  the  shared  constraints,  which  we  resolve  via  three  methods. 

First,  for  problems  with  a  large  number  of  agents  and  restrictions  on  each  agent’s  contribution 
to  overall  resource  usage,  we  can  increase  the  likelihood  of  the  sampled  solution  being  globally 
feasible  by  “padding”  the  constraints.  We  can  define  a  small  “cushion”  e3  >  0  on  each  resource 
usage  penalty,  which  modifies  the  resource  penalty  function  fj  to 

fj(ej  "t”  —  )  •  (3.3.1) 

i 

The  intuition  for  the  cushions  will  be  that  since  we  have  effectively  decreased  the  allowed  resource 
usage,  solutions  with  a  low  penalty  value  should  be  more  likely  to  yield  solutions  with  acceptable 
levels  of  resource  usage. 

Also,  following  Gordon  et  al.  [31],  we  can  define  limited  influence  of  an  agent  on  a  resource  in 
two  parts.  First,  we  assume  that  every  agent  has  a  similar  amount  of  influence  on  the  overall  usage 
of  each  resource  as  well  as  on  the  value  of  the  optimal  plan,  i.e.,  that  there  exists  a  constant  U  >  0 
such  that 


--\V*\<cJxi<-\V*\  (3.3.2) 

n  n 

--KI  <  DjjXi  <-\u* I  (3.3.3) 

n  j  j  n  J 
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for  all  i,  j,  and  27  G  C,  ,  where  u*  =  J2,  DijX*  is  the  usage  of  resource  j  in  some  optimal  solution. 
Second,  we  assume  that  a  small  change  in  the  overall  resource  usage  leads  to  only  a  small  change 
in  solution  quality,  i.e.,  that  there  are  no  sudden  jumps  in  the  solution  with  respect  to  the  resource 
usage.  In  particular,  let  V*  be  the  optimal  dual  value  obtained  using  f:)  of  the  form  (3.3.1).  Then 
we  assume  that  there  exists  a  bound  emax  >  0,  condition  number  k  >  0,  and  a  discretization  level 
A  >  0  such  that 

V;  >V'  -  -  A/n, 

j 

so  long  as  0  <  e;  <  for  all  j.  While  the  second  restriction  limits  us  from  having  hard 
constraints,  it  allows  for  the  use  of  piece-wise  linear  penalty  functions. 

Gordon  et  al.  (Theorem  1,  [31])  show  that  if  we  make  the  assumptions  above,  then  the  result  x  of 
our  randomized  rounding  algorithm,  sampled  from  a  solution  to  the  Lagrangian  relaxation  using 
€j  =  e  for  all  j  where  e  <  emax,  satisfies 

V(x)  >  V*  —  A/n  —  (nm  +  l)e  —  7 

with  probability  at  leaset  1  —  8,  where  V (x)  is  the  primal  value  of  the  sampled  solution  x,  m  is  the 
number  of  resources,  7  is  the  tolerance  to  which  the  optimization  algorithm  was  run,  and 

^  =  e~ne2/2U2\V*\2  _|_  e~ne2 /2U2\y*\2 . 

j 

Hence,  to  limit  the  probability  of  failure  to  obtain  a  feasible  solution  from  rounding  to  8,  we  can 
set  the  cushion  to  be  e  = 

Second,  we  can  repeat  the  randomized  rounding  process  until  we  obtain  a  feasible  solution,  since 
given  a  bounded  probability  of  failure  of  getting  a  feasible  solution,  the  expected  number  of  trials 
before  succeeding  is  constant. 

Last,  we  employ  a  reactive  planner  to  adjust  the  individual  plans  (either  centrally  or  by  coordinating 
the  agents)  to  satisfy  the  shared  constraints  in  order  to  ultimately  guarantee  feasibility  in  practice. 
This  reactive  planner  can  greatly  reduce  the  number  of  resampling  steps  in  practice,  since  we 
only  have  to  discover  a  plan  that  is  close  enough  to  feasible  that  the  reactive  planner  can  correct 
it. 


3.4  An  Accelerated  Gradient  Method  for  Factored  MDPs 

The  trick  of  optimizing  an  objective  augmented  with  a  strongly  convex  penalty  discussed  in 
Section  3.3.2  is  often  used  in  combination  with  accelerated  gradient  methods,  and  has  been 
successful  in  other  optimization  problems  (e.g.  [16]).  However,  in  the  case  of  factored  MDPs,  we 
face  an  additional  difficulty:  if  we  add  a  strongly  convex  penalty  such  as  the  quadratic  loss  to  the 
MDP  objective,  we  can  no  longer  use  fast  MDP  planning  algorithms  such  as  value  iteration.  If  we 
were  forced  to  back  off  to  general  convex  optimization  methods  such  as  quadratic  programming 
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when  solving  the  MDP  subproblems,  we  would  lose  much  of  the  benefit  of  the  accelerated 
gradient  algorithm:  our  iteration  count  would  be  lower,  but  each  iteration  would  be  much  more 
expensive.  To  address  this  issue,  we  propose  to  use  a  maximum  causal  entropy  penalty  [98],  a 
common  regularizer  for  MDPs  in  reinforcement  learning.  With  this  penalty,  we  can  still  use  a 
slight  modification  of  value  iteration  to  solve  each  individual  MDP.  While  causal  entropy  may 
not  be  strongly  convex  at  every  point  in  the  domain,  we  demonstrate  that  in  practice  it  can  still 
give  us  much  faster  overall  convergence. 


3.4.1  The  Augmented  Program 


Causal  entropy  is  defined  as 


J2H(Ait\\Sit), 


where 

H(Ait\\Sit)  =  -  Y  P{AiuSit)\ogP{Ait\Sit). 

Ait 

Augmented  with  causal  entropy,  our  objective  (3.2.16)  becomes 


p(4|“4A)  E«E  H^'  1 1**)  +  rJP(Au,  Su )  -  E  f?  <E  DaP^  5«)  -  6,) 

s.t.  Vi,  Ci, 

where  \  controls  the  contribution  of  the  causal  entropy  term  in  the  objective.  Then,  applying  the 
Lagrangian  relaxation,  our  goal  becomes  to  find 

n  ^  m 

V*  =  inf  max  Y  aH(Xi’  ^  ^ Xi  +  Et1^0  -  -  ao)  +  ~  E  DvVi)\  s-t  C*’ 

2=1  j= 1  2 

where,  as  before  we  define  vectors  Xi  =  P(Ait ,  Slt)  and  y,  =  P(Ait\Sit),  and  let  H(xj,  yt)  = 
Y,t'  H(Ait'\\Sitf). 

To  compute  the  subgradient,  again  we  must  solve  the  subproblems,  whose  objectives  now  contain 
the  causal  entropy  term  H(xi,yi).  Such  subproblems  can  be  solved  using  a  variant  of  value 
iteration  called  softmax  value  iteration  [97],  shown  in  Algorithm  4.  The  modification  from 
value  iteration  happens  in  the  last  line,  where  the  maximum  function  is  replaced  by  the  softmax 
function 


softmaxf  (/(:r))  =  logY1 


Then,  the  final  policy  output  by  the  algorithm  is  a  stochastic  policy: 

AQ(Su,Ait) 

P(Ait\Sit)  = 


e}Q(Sit’A  ’ 
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in  comparison  to  the  deterministic  policies  returned  by  value  iteration: 


f  1,  if  A,t  =  A*t 
P(Ait\Sit )  =  l 

[  0,  otherwise 

where  A*t  is  a  randomly  chosen  element  of  the  set  {arg  maxa  Q(Stt.  a)}. 


Algorithm  4  The  softmax  value  iteration  algorithm 
Input:  /3,  the  smoothing  parameter. 

Output:  V,  the  value  function. 

Vs,  V(s)  =  0 

for  t  4 —  1 , . . . ,  T 

•  Vs,  a,  Q(s ,  a)  =  R(s,  a)  +  V (s')P(s'|s,  a) 

•  Vs,  V(s)  =  softmaxf  (Q(s',  a)) 


Finally,  Algorithm  5  shows  the  resulting  FISTA  algorithm  for  factored  MDPs  with  hinge  loss 
penalties  (AGLR). 


Algorithm  5  The  accelerated  gradient  Lagrangian  relaxation  algorithm  for  factored  MDPs 
Input:  L,  a  Lipschitz  constant  of  VVg(A). 

Output:  x*,  policies.  \v  dual  values. 

Ao  4—  0 
ZA  4—  Ao 
do  4—  1 

for  1 4 —  1 ,  2 , . . .  T 


argmaxXi!3/i  ±H(xi,yi)  +  (n 
Vj,  A tj  4-  utj  -  \(bj  -  DijU* ) 

Vj,  A tj  4-  max(0,  min(o:i,  Xtj)) 

,  l+\/l+4d2 

dt+ i  < - 2 - 

ut+ 1  4-  A*  +  (^-)(At  -  Af_i) 


\Di)Tyi  II  solved  by  softmax  value  iteration 


Errors  in  Solution 

Augmenting  a  factored  MDP  with  causal  entropy  to  use  with  FISTA  introduces  two  kinds  of 
errors  in  the  computed  value  of  V* . 

First  stems  from  the  fact  that,  unfortunately,  causal  entropy  may  not  be  strongly  convex  in 
P(Ait,  Su )  everywhere  in  the  domain.  While  subgradient  method  is  robust  to  the  fact,  conver¬ 
gence  guarantees  for  FISTA  require  a  strongly  convex  component  in  the  objective.  Nevertheless, 
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in  practice  we  can  detect  convergence  by  examining  the  primal-dual  gap,  and  in  our  experiments 
FISTA  often  achieves  a  small  primal-dual  gap. 

The  second  source  of  error  comes  simply  from  the  fact  that  we  are  no  longer  optimizing  the 
original  objective  due  to  augmentation  by  causal  entropy.  Naturally,  the  error  depends  on  the 
strength  of  the  penalty  term  controlled  by  /?,  and  we  can  show  that  the  suboptimality  of  the 
solution  obtained  by  maximizing  the  causal  entropy  objective  is  bounded  by  (7#  log  \A\)//3, 
where  |A|  denotes  the  size  of  the  set  of  actions  in  the  MDP,  and  TH  the  time  horizon. 


3.4.2  Experiments 

We  study  the  convergence  rate  and  quality  of  SLR  and  AGLR  on  a  path  planning  domain  of 
X-shaped  bridges  and  interchanges  described  in  Section  3.1. 


Empirical  Quality  of  Solution 

First  we  examine  the  effectness  of  Lagrangian  relaxation  with  approximated  shared  constraints. 
In  particular,  we  look  at  how  plans  sampled  from  a  solution  to  the  Lagrangian  relaxation  fare  with 
respect  to  the  original  hard  shared  constraints. 

Figure  3.2  reveals  the  reduction  the  number  of  collisions  in  plans  sampled  from  solutions  to  the 
Lagrangian  relaxation  with  shared  constraints  compared  with  those  from  solutions  to  the  baseline 
of  not  using  shared  constraints  (which  is  equivalent  to  running  one  iteration  of  the  Lagrangian 
relaxation  algorithm  with  resource  prices  set  to  zero).  For  both  methods,  we  optimized  the  causal 
entropy  augmented  programs  via  AGLR.  Each  bar  represents  a  problem  setting,  corresponding 
to  a  specific  value  of  two  parameters  that  contribute  to  the  problem’s  difficulty:  the  rate  of  entry 
onto  the  bridge  and  how  likely  a  vehicle  must  change  lanes  to  reach  its  goal.  For  each  bar,  the 
number  of  collisions  is  averaged  over  4  problem  instances  of  each  setting,  where  for  each  problem 
instance  we  picked  the  best  plan  out  of  100  iterations  of  the  randomized  rounding  method.  The 
percentages  were  calculated  using  the  best  setting  of  the  smoothing  parameter  f3  for  each  method 
and  each  problem. 

Bars  above  the  zero  line  represent  problems  where  incorporating  shared  constraints  into  the 
problem  was  beneficial.  While  in  most  cases  shared  constraints  reduce  the  number  of  collisions,  it 
can  also  increase  the  number  of  collisions.  We  posit  that  such  increases  are  due  to  randomization 
playing  a  greater  role  when  shared  constraints  must  be  satisfied,  as  through  our  Lagrangian 
relaxation  the  original  hard  shared  constraints  must  be  satisified  in  expectation.  Randomization 
can  cause  suboptimality;  for  example,  when  a  problem  requires  little  coordination  and  a  nearly- 
deterministic  solution  from  not  using  shared  constraints  is  close  to  optimum  and  deviation  from 
such  causes  suboptimality. 

We  note  that  even  when  the  number  of  violated  constraints  is  reduced,  we  may  not  always  find  a 
violation-free  solution  by  solving  the  Lagrangian  relaxation.  Two  main  sources  of  error  exist. 
First  and  the  most  probable  cause  is  that  our  problem  domain  may  not  be  large  enough  for  the 
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Figure  3.2:  A  comparison  of  the  number  of  collisions  in  sampled  plans  for  20-agent  X-shaped 
bridge  traffic  management  problem  instances.  The  bars  shown  are  percentage  reductions  in 
collision  counts  when  employing  shared  constraints. 


approximation  guarantee  to  be  effective,  yielding  a  high  failure  rate  for  rounding.  Second,  the 
MDP  representation  models  the  states  as  non-trivially  large  blocks  of  space  when  in  reality  the 
space  is  continuous,  which  can  handicap  the  agents.  In  particular,  since  agents  are  restricted  to  the 
discretized  positions  and  the  small  number  of  actions  to  move  between  them,  collisions  may  be 
more  likely  than  if  they  were  given  the  full  freedom  of  the  continuous  space.  This  discretization 
problem  helps  explain  why  it  is  possible  to  correct  for  the  collisions  using  local  controllers,  even 
when  the  deliberative  plan  contains  collisions. 

Hence,  in  the  presence  of  such  violations,  we  can  employ  a  local,  reactive  planner  to  correct 
the  “nearly-good”  solutions  returned  by  optimizing  the  Lagrangian  relaxation.  For  the  traffic 
management  problem,  we  implemented  a  reactive  planner  (as  described  in  Section  3.1)  at  each 
vehicle  so  that  the  vehicles  try  to  keep  a  small  distance  from  each  other  while  following  the  plans 
returned  by  our  algorithm. 


Convergence  Experiments 

The  convergence  experiment  results  show  how  the  choice  of  the  smoothing  parameter  /?  can 
balance  the  trade  off  between  convergence  and  optimality  to  achieve  accelerated  convergence 
while  maintaining  solution  quality  comparable  to  directly  optimizing  the  linear  reward  function 
(despite  the  bias  introduced  by  the  causal  entropy  term). 

Figure  3.3(a)  shows  the  convergence  of  AGLR  with  different  values  of  smoothing  parameter  /3, 
in  terms  of  the  linear  primal  values  (without  the  causal  entropy  term),  computed  at  the  solution 
to  the  augmented  objective  at  each  iteration.  Hence  these  values  correspond  to  the  solution 
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-32.66 


Lower  bound  on  linear  reward  (at  current  x) 


Upper  bound  on  linear  reward  (at  lin-opt  x) 


(a)  Convergence  of  lower  bound  (primal)  (b)  Convergence  of  upper  bound  (dual) 


Figure  3.3:  Convergence  comparisons.  The  blue  lines  indicate  an  approximate  optimal  solution, 
as  it  upper  bounds  the  lower  bound  curves  in  (a)  and  lower  bounds  the  upper  bound  curves  in  (b). 


available  to  be  used  at  each  iteration,  and  also  serve  as  lower  bounds  to  the  optimal  solution.  The 
lower  bounds  demonstrate  that  in  all  cases  optimizing  the  augmented  objective  leads  to  improved 
convergence  albeit  at  the  expense  of  optimality  with  respect  to  the  unaugmented  objective.  A 
lower  value  of  j3  means  the  objective  has  a  greater  weight  on  causal  entropy,  and  unsurprisingly, 
runs  with  lower  f3  values  converge  more  quickly,  but  to  worse  linear  objective  values.  However, 
the  subgradient  Lagrangian  relaxation  directly  optimizing  the  linear  objective  function  (the  solid 
red  line),  while  eventually  reaching  the  optimum  solution  (whereas  running  FISTA  with  a  large 
/ 3  may  not),  converges  extremely  slowly;  at  iteration  1000  it  achieves  a  lower  bound  of  -38.543, 
which  FISTA  with  (3  =  4  reaches  at  iteration  20. 

In  Figure  3.3(b)  we  plot  upper  bounds  of  the  optimal  primal  value  of  the  same  AGLR  runs;  at 
each  iteration,  we  plot  Vo  (A),  i.e.,  the  dual  value  of  the  linear  objective  at  the  current  A.  The  plot 
shows  the  trade-off  between  convergence  speed  and  quality  very  similar  to  that  in  the  lower  bound 
plot.  Moreover,  it  shows  that  with  a  sufficiently  high  value  of  f3,  we  can  approach  the  optimum 
very  quickly;  the  blue  horizontal  line  in  both  Figures  3.3(a)-3.3(b)  represent  the  same  value  that  is 
bounded  and  closely  approached  by  the  /3  =  32  curves  in  both  plots.  This  shows  that  with  a  high 
enough  value  of  /3,  we  may  recover  the  optimal  solution  with  substantially  fewer  iterations  than 
subgradient  method. 

In  this  example  setting,  (3  —  32  appears  sufficient  to  achieve  a  good  solution  and  very  fast 
convergence.  However,  it  would  be  interesting  to  see  if  any  gain  would  come  from  adaptively 
tuning  (3  over  the  gradient  descent  iterations  to  achieve  even  faster  convergence  while  maintaining 
good  solution  quality. 

Finally,  the  experiments  demonstrate  the  robustness  of  Lagrangian  relaxation;  even  though  it  was 
optimized  using  a  potentially  non-strongly-convex  objective,  we  were  able  to  find  a  solution 
to  the  Lagrangian  relaxation  with  a  very  small  duality-gap  with  respect  to  the  non-augmented 
objective,  which  corresponds  to  finding  a  near-optimal  solution. 
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3.5  Conclusion 


In  this  chapter,  we  expanded  upon  the  mathematical  programming  framework  for  distributed 
multi-agent  planning  introduced  in  Chapter  2  via  Lagrangian  relaxation  for  more  efficient  planning 
in  time-constrained  domains.  We  argued  that  for  a  large-scale  problem  where  each  agent  is  only  a 
small  part  of  the  problem,  we  can  find  near-optimal  solutions  to  the  original  unrelaxed  problem 
via  Lagrangian  relaxation.  We  described  how  to  optimize  the  Lagrangian  relaxation  with  different 
optimization  methods,  the  subgradient  method  and  an  accelerated  gradient  method,  FISTA.  In 
particular,  we  gave  an  augmented  accelerated  gradient  method  for  factored  MDPs  using  causal 
entropy  regularization.  Despite  the  potential  errors  caused  by  altering  the  objective  function,  our 
convergence  results  show  1.  that  the  augmented  program  can  achieve  near-optimality  in  practice 
with  little  work  in  tuning  the  smoothing  parameter,  and  2.  that  it  does  so  at  a  superior  convergence 
rate  in  comparison  to  the  standard  subgradient  algorithm  on  the  original  objective. 

While  we  have  established  the  groundwork  for  efficiently  optimizing  via  Lagrangian  relaxation  on 
an  instance  of  the  traffic  management  problem,  we  believe  that  many  domains  can  benefit  from  the 
relaxation  and  the  accelerated  gradient  method.  It  would  be  interesting  to  study  the  algorithm  on 
larger  and  more  complex  problems,  and  to  extend  the  algorithm  to  adaptively  control  the  weight 
of  the  causal  entropy  smoothing  term  to  further  speed  up  convergence. 
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Chapter  4 

Planning  under  Uncertainty 


Planning  under  uncertainty  is  often  a  non-trivial  challenge  in  auction-based  distributed  planning 
frameworks.  An  advantage  of  the  MILP  representation  for  distributed  multi-agent  planning  is  its 
natural  extension  for  planning  under  uncertainty,  stochastic  programming  (e.g.  [12]),  a  widely  used 
mechanism,  particularly  in  operations  research,  for  planning  in  stages,  as  information  is  revealed 
sequentially.  The  idea  is  to  plan  for  different  scenarios,  which  are  possible  futures  sampled 
according  to  a  distribution  of  future  random  events  such  as  obstruction  of  a  road  in  path  planning 
or  the  locations  of  future  tasks  in  UAV  task  allocation  (learned,  assumed,  or  known  otherwise)1. 
Stochastic  programming  provides  us  with  a  principled  planning  framework  under  uncertainty; 
hence,  given  a  deterministic  problem  written  as  a  MILP,  we  can  easily  incorporate  uncertainty 
without  having  to  devise  a  domain-specific  recipe  for  handling  uncertainty.  Furthermore,  the 
resulting  stochastic  program  can  be  reduced  to  an  ordinary  MILP.  While  this  MILP  may  be  larger 
than  a  similar  deterministic  program,  it  can  be  solved  using  methods  discussed  in  Chapters  2 
and  3  for  MILPs  and  their  Lagrangian  relaxations. 

In  this  chapter,  we  extend  the  Lagrangian  Relaxation  framework  introduced  in  Chapter  3  to 
incorporate  uncertainty  via  stochastic  programming.  In  particular,  we  focus  on  factored  MDPs, 
where  uncertainty,  in  the  form  of  sampled  scenarios,  can  be  naturally  encoded  in  the  individual 
MDPs  and  solved  in  a  distributed  fashion.  The  number  of  resources  (i.e.,  constraints)  in  the 
resulting  program  is  linear  in  the  number  of  scenarios,  as  we  must  satisfy  the  constraints  in  each 
scenario  considered. 

To  better  understand  the  role  of  incorporating  uncertainty  in  planning,  we  study  supply  chain 
management  as  our  motivating  application.  Supply  chain  management  is  an  important  problem  in 
production  and  delivery  of  goods,  where  many  producers  must  cooperate  to  maximize  efficiency, 
and  therefore  profit.  In  particular,  we  examine  the  phenomenon  of  the  bullwhip  effect,  where 
changes  in  the  economic  environment,  e.g.,  the  consumer  demand,  create  a  whiplash  effect 

'in  particular,  the  future  events  considered  for  sampling  are  random  from  the  perspective  of  the  agents,  and  are 
assumed  to  be  independent  of  their  actions.  The  future  as  a  whole  is  affected  both  by  these  random  events  and  the 
agents’  actions,  and  so  the  future  is  (of  course)  highly  dependent  on  the  agents’  actions.  However,  our  methods  deal 
with  the  influence  of  agents’  actions  on  the  future  and  thus  on  each  other  by  having  the  agents  jointly  plan,  with 
enough  knowledge  about  other  agents’  plans. 
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through  the  supply  chain,  resulting  in  large  fluctuations  in  production  along  the  chain,  which  are 
exaggerated  further  as  we  go  away  from  the  consumer  demand.  The  bullwhip  effect  is  a  result  of 
agents  planning  without  consideration  for  a  possibly  changing  economy,  in  particular  failing  to 
plan  for  others’  reactions  to  the  changes,  and  shows  that  such  a  supply  chain  is  slow  to  respond 
and  stabilize  to  the  demand,  leading  to  economic  inefficiencies.  We  show  in  our  experiments 
that  using  stochastic  programming  we  can  lessen  the  bullwhip  effect,  as  agents  can  prepare  for 
changing  economies. 

In  the  following,  we  first  introduce  our  main  domain  of  supply  chain  management  and  give  an 
example  of  the  bullwhip  effect.  We  then  provide  an  overview  of  stochastic  programming  on 
the  Lagrangian  relaxation,  and  show  our  novel  stochastic  programming  formulation  of  factored 
MDPs,  which  can  be  efficiently  constructed  from  the  original  factored  MDPs.  Finally  we  give 
experimental  results  demonstrating  that  stochastic  programming  can  help  ameliorate  the  bullwhip 
effect,  yielding  more  efficient  plans  than  reactively  trying  to  fix  plans  that  were  generated  based 
on  the  assumption  of  a  deterministic  world. 


4.1  Supply  Chain  Management  and  the  Bullwhip  Effect 

Supply  chain  management  tackles  the  problem  of  efficient  production  and  distribution  of  goods, 
determining  a  variety  of  actions  from  inventory  management  and  production  to  transportation  and 
returns  processing.  An  artificial  example  supply  chain  is  shown  in  Figure  4.1.  A  supply  chain  may 
include  a  number  of  different  organizations,  each  of  which  may  be  composed  of  many  divisions. 
The  goal  is  for  each  entity,  whether  an  organization  or  a  smaller  unit  within,  to  produce  its  goods 
efficiently,  while  meeting  the  contracts  promised  between  it  and  its  adjoining  entities. 


Figure  4.1:  An  example  supply  chain. 

While  supply  chain  management  often  includes  competitive  factors  as  businesses  compete  for 
best  contracts,  here  we  assume  a  supply  chain  comprising  cooperative  agents  wishing  to  produce 
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and  transfer  goods  at  socially  optimal  costs.  For  example,  it  is  common  for  a  vertically-integrated 
company  to  need  to  manage  different  production  divisions  within.  While  such  divisions  will 
cooperate  for  the  overall  good  of  the  company,  they  may  be  segregated  (physically  or  managerially) 
enough  to  warrant  distributed  planning.  The  goal  then  is  to  efficiently  run  the  supply  chain  to 
maximize  profit  given  the  current  consumer  demand. 

For  simplicity,  we  assume  that  each  agent  produces  one  kind  of  goods,  and  is  locally  formulated 
as  a  Markov  decision  process  (MDP),  where  the  state  encodes  its  inventory  levels  and  the  actions 
represent  the  quantity  of  production ,  buys,  and  sells  at  each  time  step.  Also  defined  for  each  agent 
is  the  production  cost  curve,  and  the  consumer-facing  agent  is  also  aware  of  the  consumer  demand 
curve  (example  curves  are  shown  in  Figure  4.2). 

The  joint  state  is  then  the  tuple  of  individual  MDP  states,  and  the  joint  action  is  the  tuple  of 
individual  MDP  actions.  Our  constraints  limit  the  joint  actions:  we  must  match  the  selling  and 
buying  quantities  of  adjoining  agents.  Our  planning  algorithm  learns  prices  corresponding  to  each 
of  these  constraints;  we  hope  to  learn  prices  that  result  in  a  socially-optimal  joint  plan. 

We  encode  the  uncertainty  in  the  consumer  demand  in  the  form  of  the  world  economic  state, 
which  is  either  a  boom  or  a  recession.  We  assume  that  at  each  time  step  (a  production  cycle), 
the  world  may  change  its  state  with  a  small  probability  (a  biased  coin  flip).  Each  economy  is 
defined  by  the  production  cost  and  demand  curves  (e.g.  see  Figure  4.2).  In  an  economic  boom, 
the  consumer  demand  curve  is  higher,  leading  to  a  higher  equilibrium  production  point  (where 
profit  is  maximized). 


4.1.1  The  Bullwhip  Effect 

If  a  supply  chain  does  not  prepare  itself  for  a  large  change  in  the  economy  such  as  a  change  from 
a  recession  to  boom,  the  resulting  inefficiencies  manifest  themselves  in  the  bullwhip  effect  [26]. 
Figure  4.3  demonstrates  the  bullwhip  effect  on  a  linear  chain  with  four  agents. 

In  this  example,  before  time  t  =  1,  all  agents  are  producing  for  the  equilibrium  consumer  demand 
level  (1  unit)  in  recession.  At  t  =  1  the  economy  changes  to  boom,  and  the  consumer  demand 
changes  the  optimal  equilibrium  production  levels  to  2  units.  Agent  4,  who  produces  the  final 
consumer  product,  sells  2  units  to  settle  into  the  equilibrium,  at  the  expense  of  its  inventory2, 
then  must  order  an  extra  unit  of  material  from  agent  3  and  produce  an  extra  unit  to  replenish  its 
inventory.  Agent  3,  after  selling  extra  units  from  its  inventory  at  t  =  1  must  produce  even  more 
extra  units  than  agent  4  in  order  to  replenish  agent  4  and  its  own  inventories.  The  effect  propagates 
up  the  supply  chain  (shown  by  the  arrow). 

In  our  example  above,  agents  “overreact”  to  the  changes  in  the  demand  at  each  edge  in  order  to  refill 
their  inventories.  However,  the  bullwhip  effect  can  be  even  more  exaggerated  if  agents,  upon  seeing 
the  increased  demand,  anticipate  an  even  higher  demand  in  the  future  and  overproduce.  Hence,  as 

2In  this  example,  a  small  cost  is  applied  to  low  inventory  (for  both  raw  materials  and  produced  goods)  to  encourage 
a  non-zero  inventory  level;  such  a  buffer  is  often  used  to  cushion  the  effects  of  sudden  changes  in  demand  and  to  ease 
inventory  management. 
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Cost  &  reward  curves,  recession  Cost  &  reward  curves,  boom 


Profit  =  Revenue  -  Cost 


Profit  =  Revenue  -  Cost 


Figure  4.2:  Example  production  cost  and  demand  curves  for  each  economic  state.  Here  we  have 
strongly  convex  production  curves  encoding  overtime  labor  costs  and  capital  investments  required 
to  expand  production,  and  concave  demand  curves  with  diminishing  returns. 


agents  up  the  supply  chain  try  to  meet  the  new  demand,  the  deviation  from  the  equilibrium  demand 
is  amplified3,  causing  inefficient  production  (as  the  production  costs  are  convex)  and  instability 
in  inventory,  and  possibly  run  out  of  stock.  In  Section  4.3,  we  demonstrate  how  stochastic 
programming  can  ameliorate  the  bullwhip  effect  and  lead  to  more  efficient  production. 


4.2  Stochastic  Programming 


Stochastic  programming  incorporates  uncertainty  into  planning  by  modeling  it  as  a  distribution  of 
future  random  events4.  Let  v  be  the  random  variable  for  the  sequence  of  future  events,  and  p(v) 
denote  the  probability  distribution  for  v.  Stochastic  programming  simply  minimizes  the  expected 
cost  with  respect  to  p(v). 

In  particular,  a  /e-stage  stochastic  program  with  recourse  models  the  world  as  k  stages  of  informa- 
3The  amplification  gives  rise  to  the  name  bullwhip  effect. 

4We  do  not  include  our  agents’  actions  as  part  of  future  events,  only  global  events  random  to  all  agents.  The 
uncertainty  in  other  agents’  actions  from  an  individual  agent’s  perspective  is  handled  by  coordinating  through  resource 
prices  to  obtain  a  joint  solution. 
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Agent  1 


Agent  2 


Agent  3 


Agent  4 


Figure  4.3:  An  example  of  the  bullwhip  effect  on  a  four-agent  supply  chain.  The  red  arrow  shows 
the  rise  in  demand  propagating  through  the  chain  when  the  economy  changes  to  boom  from 
recession  and  the  consumer  demand  increases. 


tion  revelation.  At  stage  t,  some  information  is  revealed  about  the  random  events,  and  we  follow 
the  plans  made  specifically  for  the  world  corresponding  to  the  revealed  information,  where  the 
plans  are  represented  by  the  t-th  stage  recourse  variables.  Here  we  present  the  2-stage  stochastic 
program  for  simplicity;  however,  the  2-stage  representation  is  sufficient  for  factored  MDPs,  as 
we  will  show  in  Section  4.2.1  that  for  factored  MDPs  satisfying  some  assumptions  (e.g.,  on  the 
policy  class),  we  can  encode  arbitrarily  many  stages  in  2  stages. 

In  contrast  to  a  MILP  (3.2.1)-(3.2.3),  we  can  write  a  2-stage  stochastic  program  as: 

max  £ct  Vi  +  Ep{v)[Q(yhv)]  (4.2.1) 

^  i=l:n 

s.t.  D iVi  ^  b  (4-2.2) 

i=l:n 

VitCi ,  Vi  =  l,...,n  (4.2.3) 

where  y  denotes  the  first- stage  variables  and  Q(yi,v )  is  the  optimal  value  for  agent  i  of  the 
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second- stage  problem 


max  q(v)TXi 

(4.2.4) 

i=l:n 

s.t.  D\xi  <  b' 

(4.2.5) 

i=l:n 

T(v)y  +  W (v)x  =  h(v),  Vi  =  1, . . 

• ) n 

(4.2.6) 

m 

< 

ii 

.,n. 

(4.2.7) 

Here  Xi  are  the  recourse  variables  for  agent  i  under  the  future  events  represented  by  v,  where 
q(v),  T(v),W (v).  h(v)  specify  the  second-stage  problem  for  v.  For  multi-agent  planning  prob¬ 
lems,  which  are  largely  factored,  T(v)  and  W{v)  are  mostly  block-diagonal,  with  a  small  number 
of  rows  for  shared  constraints. 

Certain  distributions  p  allow  a  closed-form,  or  a  compact  representation  of  the  objective  (4.2.1). 
For  example,  for  the  supply  chain  mangement  problem,  we  assumed  a  sequence  of  Bernoulli  coin 
flips,  which  seems  simple  enough  to  yield  a  compact  representation  for  the  stochastic  program. 
Represented  as  an  MDP,  the  individual  problem  size  needs  to  be  multiplied  only  by  the  number 
of  possible  world  states,  Sw  | ,  to  be  able  to  represent  the  cost  function  appropriately.  However, 
the  number  of  shared  constraints  (4.2.2)  increases  exponentially  in  the  number  of  timesteps  for 
which  we  plan,  T,  as  the  number  of  possible  futures  v  is  |<SWji_1.  For  factored  MDPs,  the  fact 
that  we  must  keep  track  of  the  constraints  for  each  possible  future  implies  that  each  agent’s  MDP 
state  space  must  also  be  able  to  distinguish  between  the  different  futures,  and  thus  the  state  space 
may  be  0(|SW|t_1)  larger  than  the  deterministic  version  of  the  MDP.  Furturemore,  for  many 
distributions  and  cost  functions,  there  is  no  closed-form  expression  for  the  expected  cost  objective, 
or  computing  it  exactly  is  intractable. 

To  make  the  problem  more  tractable,  we  approximate  the  exponential  space  of  possible  futures  by 
sampling  from  the  space.  Here  we  refer  to  resulting  samples  as  scenarios.  Given  a  set  of  scenarios 
Z,  we  can  simply  rewrite  (4.2. 1)— (4.2.3)  as  a  scenario-based  program: 


max  V  cTyi  +  Y]  P(()q(()TxiC 

y.x  z z — * 

i=l:n  (eZ 

s.,.  (£)  =  (£), 

i=l:n  \  s/  v  / 

T(Oyi-fw{<;)xi  =  h(0, 

Hi  G  Ci,  Xi£  G  C', 


(4.2.8) 

VC 

(4.2.9) 

VC,i  =  l,.. 

•  ,n 

(4.2.10) 

,,n 

(4.2.11) 

where  x%  denotes  the  plan  for  agent  i  under  scenario  (.  Here,  if  the  set  Z  contains  every  possible 
scenario  (with  the  frequency  matching  the  distribution  p),  the  scenario-based  program  is  in  fact 
equivalent  to  the  stochastic  program  (4.2.1)-(4.2.3).  The  hope  is  that  a  tractable  number  of 
scenarios  will  suffice  to  compute  a  close  approximation  to  the  true  optimal  first-stage  decision 
variables  y  and  the  optimal  value  of  (4.2.1).  (And  in  fact  there  are  many  results,  both  theoretical 
and  experimental,  which  support  this  hope,  e.g.  [45,  46].) 
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4.2.1  Stochastic  Programming  for  Factored  MDPs 

In  Chapter  3,  we  introduced  factored  MDPs,  which  are  represented  by  “small”  MDPs  specifying 
individual-agent  planning  problems,  in  addition  to  a  set  of  shared  constraints.  In  particular,  we 
discussed  deterministic  factored  MDPs  where  the  agents’  transition  probabilities,  initial  state  dis¬ 
tributions,  policies,  and  state-action  probabilities  are  all  deterministic,  i.e.,  0  or  1.  Here  we  extend 
deterministic  factored  MDPs  to  incorporate  uncertainty  using  stochastic  programming. 

Following  the  notation  in  Section  3.2,  define  Ait,  Stf  to  be  the  random  variables  for  the  action 
and  state  of  agent  i  at  time  t.  Given  the  reward  vector  r*,  the  initial  state  distribution  P(Sl\  ),  and 
the  transition  probabilities  P(Sit\Aitt_i,  Sijt-i),  the  goal  of  MDP  planning  is  to  find  an  optimal 
policy  P(Ai\Si )  =  [P(A,i |,SVi). . . . ,  P(AlT\StT)]  for  a  planning  horizon  T  that  maximizes  the 
reward  rJP(Ai,  Si).  In  addition,  our  factored  MDPs  define  the  interaction  between  agents  as 
shared  constraints  in  the  form  of  piece- wise  linear  functions  fj  for  each  resource  j. 

In  the  stochastic  program  generalizing  the  deterministic  factored  MDP  (3.2.16),  we  let  the  first- 
stage  optimization  variables  be  the  randomness-dependent  policies,  denoted  P{Ai\Si:  V )  for  agent 
i  and  the  random  event  V.  Following  the  same  convention  for  the  deterministic  MDP,  P(A,|,S'(r  V ) 
is  a  vector  indexed  by  time.  However,  here  we  must  specify  the  element  at  t  as  P(Alt\Stt.  Vt ) , 
corresponding  to  the  policy  at  time  t  given  Vt,  the  information  available  up  to  time  t,  since  the 
policy  decisions  at  time  t  should  not  depend  on  information  from  the  future.  Finally,  we  write 
P(A\S,  V )  =  [P(Ai  | V) . . .  P(An\Sn,  V )]  for  the  vector  of  all  agents’  policies. 

Now  we  are  ready  to  write  the  stochastic  program: 

max  Ep[Q(P(A\S,V),v)}  (4.2.12) 

P(A\S,V) 

s.t.  ^2P(At\Sit,Vt)  =  1,  VSu,  Vt,  t,  i  (4.2.13) 

A-it 

P(Ait\Sit,  Vt)  e  {0, 1},  Vf,  Ait,  Sit,  Vt,  i  (4.2.14) 

with  the  second- stage  problem  for  a  specific  v: 

m 

^  ri{vfP(Au  Si\v)  -  ^  DijP(Ai:  5i|v)  -  bj)  (4.2.15) 

i=l:n  j= 1  i 

P(An,  Sn\v)  =  P(An\Sii,v)P(Sii\v),  VAil?  Sn,  i  (4.2.16) 

P(Ait,  Su\v)  =  P(Ait\Sit,  v)  E  E  P(Sit  |  Aiit_! ,  ,  v)P(Ai,t_1 1 ,  v) , 

VAit,Sit,i,t>  1  (4.2.17) 

J2  P(Ait,Sit\v)  =  1,  Vt,i  (4.2.18) 

A-HiSit 

where  the  second-stage  variables  are  the  stage-action  probabilities  P(Ait,  Su\v),  and 
Ti(v),  P(Sn\v),  and  P(Sit\Ai>t-i,Sijt-i,v)  are  constants.  In  particular,  probabilities 
P(S,jt|Ajit_i,  Sij-i,  v)  and  P(Sn\v)  are  deterministic. 


max 


s.t. 
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It  is  important  to  emphasize  that  the  sub-vector  of  P(Ai\Si,V)  for  the  policy  at  time  t, 
P(Ait\Sit,  Vt),  depends  only  on  the  information  available  up  to  time  t,  which  is  represented 
by  the  random  variable  Vt.  In  particular,  consider  writing  the  scenario-based  program  for 
(4.2.12)-(4.2.18).  The  value  of  Vt  may  not  tell  us  exactly  which  scenario  the  world  is  in,  since  a 
scenario  is  defined  by  the  value  of  random  events  through  the  planning  horizon  T.  Instead,  it 
can  tell  us  which  set  of  scenarios  are  compatible  with  the  information  given  so  far.  Hence  the 
scenarios  in  such  a  set  are  indistinguishable  from  each  other  given  what  is  currently  known,  and 
the  set  is  an  information  set.  Since  at  each  time  step  t  we  may  not  know  the  exact  scenario  we  are 
in,  but  we  do  know  which  information  set  is  active,  our  policies  cannot  be  defined  for  individual 
scenarios,  but  for  the  active  information  set  given  the  value  of  Vt. 

Let  1^  denote  the  vector  of  information  sets  scenario  £  belogs  to  over  time  such  that  I^t  is  the 
information  set  £  belongs  to  at  time  t.  Then  we  can  write  the  scenario  form  of  (4.2.12)-(4.2.18) 
given  the  set  of  scenarios  Z  as 

m 

E  Y.lPi-Or(0T P(A,  Si|/C)]  -  E  E  /“'<E  Dvp(A,  Si|/<)  -  b,)  (4.2,19) 

i= \'.n  C^Z  j— 1  i 

J2P(Ait\Sit,Ict )  =  1,  VSit,t,(,i  (4.2.20) 

An 

P(Ait\Sit,Ict )  G  {0,1},  VAit,Sit,t,(,i  (4.2.21) 

P(AiuSa\Ic)  =  P(Ai\Sn,I()P(Sii\k)s  VAit,SitX,i  (4.2.22) 

P(Ait,Sit\I<)  =  P(Ait\Sit,I<)  E  E  PiSulA^  i,  SM_!,  I<)P(Aitt-i\Si,t-1,  /f), 

&i,t  —  1  Ai^t—  1 

VAit,Sit,t>  l,C,i  (4.2.23) 

£  P(Ait,Sit\Ic)  =  l,  (4.2.24) 

An, Sn 

Here  we  note  that  the  shared  constraints  must  be  satisfied  in  every  scenario,  not  in  expectation 
over  all  scenarios.  Also,  the  individual  constraints  (4.2.20)-(4.2.24)  defined  over  (  are  redundant, 
since  the  same  information  set  appears  in  multiple  scenarios.  We  show  below  in  Section  4.2.2 
how  to  build  an  efficient  form  of  the  factored  MDP  that  reinforces  both  the  individual  constraints 
on  information  sets  and  the  shared  constraints  for  each  scenario. 

Note  that  the  scenario-based  program  here  is  not  technically  a  MILP,  since  constraints  (4.2.23) 
are  not  linear.  However,  since  P(Sit\Aijt-i,Sitt~i,v)  and  P(St\  |c)  are  binary,  we  can  obtain 
an  equivalent  MILP  by  writing  each  of  the  constraints  as  an  equivalent  set  of  linear  constraints. 
Here  we  leave  them  in  the  more  intuitive  form  for  readability.  In  contrast,  constraints  (4.2.17) 
in  the  second  stage  problems  for  the  stochastic  program  are  linear,  since  the  first- stage  variables 
P(A\S,  V )  are  considered  fixed  in  the  second  stage  problem. 

Again,  the  number  of  scenarios  in  Z  affects  the  quality  of  the  scenario-based  approximation.  As 
we  sample  more  scenarios,  we  expect  convergence  to  the  true  distribution,  and  thus  to  the  true 
optimal  solution.  In  particular,  the  number  of  scenarios  limits  the  complexity  class  of  the  policies, 
since  too  few  samples  can  lead  to  overfitting  the  policies.  While  in  this  thesis  we  do  not  address 


max 

y,x 

s.t. 
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the  convergence  rate  (i.e.,  the  number  of  samples  required  to  obtain  a  reasonable  answer  given  a 
class  of  policies),  we  believe  we  can  bound  our  convergence  rate.  One  possibility  would  be  to  use 
a  generalized  VC-dimension  argument  similar  to  that  given  by  Kearns  et  al.  [46]  for  planning  in 
single-agent  MDPs  by  sampling.  Such  a  theory  is  likely  to  suggest  a  large  number  of  samples  for 
small  errors,  however  in  our  experiments  we  find  that  a  rather  small  number  of  scenarios,  10  to 
50,  is  sufficient  (Section  4.3). 

Now  let  denote  the  subset  of  (4.2.20)-(4.2.24)  pertaining  to  agent  i  and  scenario 
c,  and  write  =  [P(Atl,  . . .  P(AlTl  StT\Iffi  for  state-action  probabilities  and 

=  [P(An\Sn,  P(Air\SiT,  /(_)]  f°r  policies  under  scenario  (.  Then  the  Lagrangian 
relaxation  of  the  scenario-based  stochastic  program  for  factored  MDPs  becomes: 

n  m 

inf  max  J^[P(C)  ^  rj(xi<:]  +  f)  -  Aif  (  D%]xK  -  fy)]] 

x’y  c ez  i= i  Cez  j= i  i-i:» 

s.t.  V(z,0,ac,  (4.2.25) 

where  A ^  now  corresponds  to  the  price  of  resource  j  under  scenario  (\ 

The  derivation  presented  here  was  based  on  the  definition  of  factored  MDPs  introduced  in 
Chapter  3,  where  state  transitions  for  each  agent  depend  only  on  its  own  previous  state  and 
action,  and  the  interaction  between  agents  is  captured  solely  through  the  shared  constraints. 
However,  our  framework  for  factored  MDPs  is  also  amenable  to  general  factored  MDPs 
(e.g.  [32]),  where  the  interaction  between  agents  is  captured  by  the  joint  transition  probabilities 
P(Si,t+i\Ait, . . . ,  Ant ,  Sit,  •  •  • ,  Snt).  Given  such  state  transitions,  we  can  turn  them  into  shared 
constraints  in  our  MILP  formulation  by  translating  them  into  logical  formulas  and  introducing 
auxilary  variables  when  needed  (for  example  to  enforce  linearity  in  the  final  constraints).  The 
joint  transition  probabilities  will  translate  to  the  matrices  T{v)  and  W (v)  in  (4.2.6),  and  hence  the 
level  of  coupling  between  agents  will  manifest  in  the  “blockiness”  of  the  matrices. 

Hence,  our  framework  is  quite  flexible,  encoding  many  different  types  of  dependences  that  would 
not  obviously  be  amenable  to  market-based  planning;  the  question  is  for  what  class  of  problems 
our  representation  is  efficient.  For  example,  if  the  joint  transition  probabilities  contain  very  tight 
coupling  over  many  agents,  representing  them  as  shared  constraints  may  take  many  auxilary 
variables,  increasing  the  state  space,  and  also  create  a  large  non-block-diagonal  section  in  T(v) 
and  W(v).  It  would  be  an  interesting  area  of  future  work  to  define  and  quantify  the  level  of 
coupling  between  agents  specified  in  joint  transition  probabilities  in  relation  to  the  efficiency  of 
representing  them  in  our  representation. 


4.2.2  Scenario-Based  Factored  MDPs 

Given  a  set  of  scenarios  Z,  we  must  find  a  plan  under  each  scenario,  hence  a  naive  implementation 
of  the  Lagrangian  relaxation  method  for  the  scenario-based  stochastic  program  requires  \Z\  MDP 
subproblems  for  each  agent.  In  addition,  we  must  enforce  further  constraints  to  ensure  consistency 
between  the  policies  for  the  scenarios  in  the  same  information  set.  We  can,  however,  collapse  the 
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\Z\  MDPs  into  a  single  MDP  using  the  information  sets;  a  state  in  the  combined  MDP  represents 
a  tuple  consisting  of  an  original  state  and  an  information  set,  the  set  of  scenarios  indistinguishable 
from  each  other  given  the  information  so  far.  The  resulting  representation  is  more  compact,  and 
leads  to  a  faster  running  time  for  value  iteration,  the  subproblem  solver,  and  thus  faster  global 
planning. 

Figure  4.4  illustrates  a  transformation  of  two  MDPs  into  a  collapsed  MDP.  Scenarios  are  sampled 
from  a  sequence  of  Bernoulli  coin  flips,  where  the  coin  flip  decides  whether  the  world  state 
switches.  The  example  represents  a  simplified  version  of  the  supply  chain  management  problem. 
We  can  think  of  “ws”,  the  world  state,  as  being  in  recession  or  boom,  “as”,  the  agent’s  state,  as 
its  inventory  level,  action  at  t  =  1  as  the  number  of  products  produced,  action  at  t  =  2  as  the 
number  of  products  sold.  Each  scenario  shown  has  different  world  states  over  time,  which  changes 
the  reward  from  selling  products  (see  r(s,  a  =  1)  at  t  =  2).  The  combined  MDP  captures  the 
information  sets  by  grouping  scenarios  whose  paths  up  to  time  t  at  the  state  are  identical.  Here  we 
assume  the  scenarios  are  uniformly  sampled  in  order  to  construct  the  transition  probabilities  in 
the  combined  MDP. 


Alternatively,  we  can  view  the  collapsed  MDP  as  a  cross  product  between  the  original  deterministic 
MDP  and  the  scenario  tree,  which  is  a  tree  MDP  representing  the  transitions  between  information 
sets  of  scenarios,  or  scenario  groups.  Let  G  denote  the  set  of  scenario  groups,  where  a  scenario 
group  g  is  a  tuple  of  a  set  of  scenarios  and  time  tig).  Also  let  P(Gt+i \Gt)  define  the  transition 
probability  between  scenario  groups.  In  MDP  notation,  the  scenario  tree  can  be  written  as  the 
tuple  (G,  A,  P(-,  •),  R(-,  •)),  where  A  is  a  singleton  set  as  we  assume  that  actions  do  not  affect 
the  transitions  in  the  scenario  tree  for  notational  simplicity.  Also,  we  let  the  rewards  R(g,  a)  =  0 
since  they  have  no  role  in  a  scenario  tree. 

Likewise,  we  can  write  each  scenario- specific  MDP  as  (So,  A0,  P0(-,  •),  Ro(-,  •)),  where,  for 
simplicity  we  assume  that  all  scenario-specific  MDPs  share  the  same  set  of  states  So,  actions 
Aq,  transition  probabilities  P0(st+i|st,  <h),  and  rewards  Ro(st,  at)5.  Denoting  Gf  to  be  the  ran¬ 
dom  variable  representing  a  scenario  group  at  time  t,  we  can  write  the  collapsed  MDP  as 
(S' ,  A' ,P' (-,-),  R' (-,-))  where 


S' 

A' 


P'(st+i  |  s*,  at) 

R'(s ,  a) 


{s  =  (g(s),s0(s))\g(s)  e  G,s0(s)  G  S0}, 

Ao, 

Po(s0(st+i)\sQ(st),at)P(g(st+i)\g(st)), 

R0(s0(s),a). 


As  seen  in  the  expression  for  the  Lagrangian  relaxation  (4.2.25),  constraints  in  the  stochastic 
program  are  simply  the  original  constraints  replicated  for  each  scenario.  Hence,  to  compute  the 
resource  usage  under  each  scenario,  JT  D^x^,  we  simply  need  to  read  off  =  P^(Ait,  Slt) 
from  the  collapsed  MDP. 

5For  our  transformation  method,  at  least  the  MDPs  for  the  scenarios  in  the  same  scenario  group  at  time  t  must 
share  the  same  states,  actions,  transition  probabilities,  and  rewards  at  time  t. 
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Scenario  1 

Scenario  2 

Scenario  3 

(a)  Original  MDPs 


(b)  Combined  MDP 


Figure  4.4:  Transformation  of  (a)  separate,  scenario-specific  MDPs  into  (b)  one  MDP.  “ws”  refers 
to  the  world  state  (determined  by  the  coin  flip  at  each  time  step),  “as”  the  agent’s  state  (e.g.  the 
inventory  level),  and  in  the  combined  MDP  “sg”  denotes  the  set  of  scenarios  the  world  could 
belong  to  given  the  state. 
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Thus  the  resulting  problem  is  another,  larger,  factored  MDP,  which  may  be  solved  via  the  same 
optimization  algorithms  as  discussed  in  Section  3.3. 


Execution 

Given  a  solution  from  optimizing  stochastic  factored  MDP,  we  can  execute  the  plan  by  sampling 
the  policies  as  described  for  the  deterministic  program  in  Section  3.2.3.  Additionally,  the  transition 
from  a  state  given  the  sampled  action  is  determined  by  the  world  state  for  the  next  time  step. 
However,  as  the  set  of  scenarios  in  the  stochastic  MDP  may  not  cover  the  entire  space  of  scenarios, 
we  may  run  into  a  state  with  no  valid  transitions  (i.e.,  we  run  out  of  scenarios).  In  such  cases,  we 
transition  to  a  state  matching  the  world  and  agent  state  but  associated  with  another  set  of  scenarios. 
Even  though  the  new  state’s  past  (sequence  of  world  states)  may  not  match  that  of  the  original  path, 
it  is  a  good  approximation,  since  the  states  are  Markovian;  if  the  sampled  scenarios  represented 
the  distribution  of  world  states  exactly,  over  infinite  time  steps,  every  state  with  the  same  world 
state  and  agent  state  values  would  in  fact  be  equivalent,  regardless  of  the  time  step.  However, 
while  this  heuristic  seems  to  work  well,  it  would  be  interesting  to  explore  different  possibilities 
for  dealing  with  out-of-sample  execution  paths.  For  example,  “featurizing”  the  scenario  space 
so  that  the  policies  depend  only  on  some  features  of  the  scenario  could  allow  us  to  completely 
sample  the  feature  space  and  never  encounter  an  unplanned-for  outcome,  which  would  in  fact 
eliminate  the  need  for  a  heuristic. 


4.3  Experiments 


We  demonstrate  the  improvement  in  efficiency  achieved  by  stochastic  programming  over  working 
with  deterministic  plans  (such  as  shown  in  Section  4.1.1)  on  a  set  of  simple  supply  chains.  In 
particular,  we  focus  on  linear  supply  chains  to  study  the  impact  of  stochastic  programming  on 
the  bullwhip  effect,  since  the  observation  of  bullwhip  effect  is  independent  of  the  width  of  the 
supply  chain.  In  each  case  we  assume  two  world  states:  recession  and  boom.  Thus,  each  scenario 
is  a  sequence  of  world  states,  sampled  assuming  that  the  world  state  may  change  with  a  small 
probability  at  each  time  step  (we  use  0.1). 

We  employ  causal  entropy  augmentation  (Section  3.4.1)  for  faster  optimization  using  L-BFGS6. 
As  a  solution  to  the  Fagrangian  relaxation,  even  with  near- zero  hinge  loss  value,  only  satisfies  the 
original  constraints  of  the  MIP  in  expectation ,  we  built  a  simple  reactive  planner  to  be  used  at 
execution  time  to  ensure  that  the  sell  and  buy  amounts  match  at  every  edge  in  the  supply  chain. 
Given  sampled  plans  per  the  execution  method  in  Section  4.2.1,  the  reactive  planner  simply  tries 
to  satisfy  the  demand  at  each  edge  if  possible  (i.e.,  sell  as  much  as  possible).  The  same  reactive 
planning  is  used  at  execution  of  plans  from  non-stochastic  planning  (such  as  those  used  to  create 
the  bullwhip  example  in  Section  4.1.1). 

6 While  the  objective  is  not  twice  differentiable  due  to  the  hinge  loss  and  thus  L-BFGS  may  not  theoretically 
converge  on  the  problem,  in  our  experiments  convergence  was  achieved  with  little  tweaking  of  the  parameters  (the 
hinge  loss  slopes  ay  and  the  causal  entropy  weight  8).  Here  we  use  8  =  100  and  a:i  values  in  the  range  of  200-1000. 
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Figures  4.2-4 .4  show  the  improvement  percentage  in  total  reward  (revenue  —  production  cost) 
of  stochastic  planners  (using  10,  20,  and  50  scenarios)  over  the  deterministic  planner,  for  supply 
chains  of  lengths  ranging  from  2  to  5. 7  Tables  4.2  and  4.3  give  the  results  on  the  “training  set” 
of  scenarios,  i.e.,  the  scenarios  that  appear  in  the  sampled  set  for  each  stochastic  planner.  They 
serve  as  a  sanity  check;  we  expect  the  stochastic  planner  to  generally  outperform  the  deterministic 
planner  since  the  stochastic  planner  plans  exactly  for  the  set  of  scenarios  given.  Table  4.3  shows 
numbers  for  scenarios  where  the  world  state  changes  at  least  once.  As  expected,  the  stochastic 
planners  outperform  the  deterministic  planner.  Table  4.2  on  the  other  hand  gives  the  results  on 
the  scenario  where  the  world  state  stays  in  recession.  The  deterministic  planner  should  give 
the  optimal  plan  for  such  a  case;  however,  since  plans  sometimes  may  not  satisfy  the  shared 
constraints,  it  is  possible  for  a  stochastic  planner  to  do  better  than  the  deterministic  planner,  once 
the  reactive  planner  has  corrected  for  the  broken  constraints.  In  general,  we  see  that  all  planners 
perform  comparably  on  the  unchanging-economy  scenario,  suggesting  that  the  stochastic  planners 
lose  little  on  optimalizing  for  the  unchanging  scenario  (the  best  case  for  the  deterministic  planner) 
while  improving  performance  on  scenarios  containing  changes  in  world  state. 

Table  4.4  shows  the  rewards  on  the  “test  set”,  i.e.,  scenarios  sampled  according  to  the  distirbution, 
excluding  the  unchanging-economy  scenario.  Here  we  evaluated  each  planner  on  10,000  scenarios. 
The  results  again  show  that  the  stochastic  planners  outperform  the  deterministic  planner  on 
changing  scenarios.  It  is  also  worth  noting  that  the  stochastic  planners  give  little  improvement 
upon  the  deterministic  planner  on  the  2-agent  chain  where  we  expect  the  bullwhip  effect  to  be 
least  problematic,  which  suggests  that  the  stochastic  planners  achieve  their  gains  by  ameliorating 
the  bullwhip  effect. 


4.4  Conclusion 

Planning  under  uncertainty  often  presents  to  be  a  non-trivial  challenge  to  auction-based  distributed 
planning  frameworks.  In  this  chapter,  we  introduced  a  natural  extension  to  our  distributed  multi¬ 
agent  planning  framework  using  stochastic  programming.  In  particular,  we  provided  a  method 
for  compactly  representing  uncertainty  in  factored  MDPs  while  maintaining  the  structure  that 
allows  us  to  employ  the  same  set  of  optimization  methods  used  for  non-scenario-based  factored 
MDPs.  Using  supply  chain  management  as  an  example  domain,  we  demonstrated  that  planning 
under  uncertainty  via  our  method  can  in  fact  provide  better  plans  in  a  changing  environment, 
while  sacrificing  little  in  solution  quality  for  when  the  environment  stays  static.  While  our 
experiments  cover  a  limited  set  of  supply  chains,  we  showed  that  an  interesting  phenomenon  in 
economics,  the  bullwhip  effect,  can  be  ameliorated  by  using  a  relatively  small  set  of  scenarios.  In 
the  future,  it  would  be  beneficial  to  investigate  how  we  can  scale  our  method  to  larger  problems 
and  correspondingly  larger  sets  of  scenarios.  It  would  also  be  interesting  to  provide  statistical 
guarantees  on  the  solution  quality  in  terms  of  the  number  of  samples  used.  Finally,  we  believe 
incorporating  communication  at  run-time  may  help  agents  adapt  more  quickly  and  efficiently  in  a 
changing  environment. 

7  Sign  test  is  significant  for  improvement  of  the  stochastic  planner  over  the  deterministic  planner  for  all  tables. 
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#agents 

10  scenarios 

20  scenarios 

50  scenarios 

2 

20 

20 

1 

3 

20 

1 

1 

4 

20 

1 

1 

5 

20 

1 

1 

Table  4. 1 :  Number  of  scenarios  trees  trained  and  tested. 


#agents 

10  scenarios 

20  scenarios 

50  scenarios 

2 

0.42% 

0.67% 

1.38% 

3 

-0.18% 

-0.01% 

0.17% 

4 

1.71% 

0.74% 

0.67% 

5 

1.15% 

1.13% 

1.09% 

Table  4.2:  Improvement  of  the  stochastic  planners  with  different  size  scenario  sets  over  the 
deterministic  planner,  on  the  scenario  with  the  economy  unchanging  from  recession.  The  results 
are  averaged  over  the  samples  in  “training”  and  “test”  sets,  and  look  very  similar  to  those  for  each 
set. 


#agents 

10  scenarios 

20  scenarios 

50  scenarios 

2 

0.04% 

-0.37% 

3.65% 

3 

2.95% 

6.21% 

2.41% 

4 

4.11% 

6.85% 

4.78% 

5 

4.66% 

4.97% 

6.84% 

Table  4.3:  Improvement  of  the  stochastic  planners  with  different  size  scenario  sets  over  the 
deterministic  planner,  on  the  “training  set”  of  scenarios  with  changing  economies. 


#agents 

10  scenarios 

20  scenarios 

50  scenarios 

2 

0.31% 

-0.12% 

10.53% 

3 

2.66% 

6.59% 

2.54% 

4 

7.27% 

6.43% 

4.22% 

5 

7.16% 

4.89% 

6.58% 

Table  4.4:  Improvement  of  the  stochastic  planners  with  different  size  scenario  sets  over  the 
deterministic  planner,  on  the  “test  set”  of  scenarios  with  changing  economies. 
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Chapter  5 
Related  Work 

5.1  Market-based  Distributed  Multi- Agent  Planning 


Various  forms  of  market-based  distributed  planning  have  been  studied  for  collaborative  domains. 
The  contract  net  protocol  [76]  gave  an  early  general  market-based  framework  for  formulating 
collaborative  planning  problems.  More  specific  work  has  subsequently  followed,  in  particular 
by  Dias  et  al.  [21],  Gerkey  [27],  and  Koenig  et  al.  [50]  for  task  allocation.  A  common  feature  in 
market-based  algorithms  is  the  trusted  auctioneer  that  assigns  tasks  based  on  agents’  bids,  which 
may  be  used  alone  (e.g.  [48],  [49],  [81]),  or  in  conjunction  with  agent-to-agent  negotiation  to 
improve  the  solution  quality  after  the  central  initial  allocation  (e.g.  [95],  [20],  [53]). 

Otherwise,  most  existing  market-based  algorithms  can  be  classified  into  two  categories:  those 
based  on  single-item  auctions  and  those  based  on  combinatorial  auctions.  In  a  single-item  auction, 
each  agent  bids  for  each  item  separately,  and  the  items  are  allocated  independently  of  each  other 
to  the  agents.  Most  work  has  been  in  this  setting,  including  recent  papers  that  have  presented 
algorithms  with  approximation  guarantees  on  the  cost  of  the  global  solution,  namely  those  by 
Lagoudakis  et  al.  [52]  for  sequential  single-task  auctions,  by  Koenig  et  al.  [49]  for  sequential 
single-task  auctions  with  regret  clearing,  by  Zheng  et  al.  [96]  for  an  improved  version  of  sequential 
single-task  auctions  using  hill  climbing,  and  by  Gerkey  [27]  for  one-task-per-robot  domains.  Most 
methods  employ  a  greedy  algorithm  for  the  auctioneer  to  allocate  tasks  (i.e.,  to  the  highest  bidder), 
and  thus  are  extremely  fast. 

However,  complex  constraints  often  found  in  more  realistic  problems  such  as  those  for  collision 
avoidance  or  task  deadlines  cannot  be  considered  naturally  in  the  single-item  auction  framework 
as  the  constraints  will  force  the  allocations  of  some  tasks  to  depend  on  others’.  Combinatorial 
auctions  [18]  allow  for  such  constraints  by  letting  agents  bid  on  bundles  of  resources,  and  have 
been  studied  in  the  collaborative  setting  for  robot  exploration  [10],  role  assignment  [38],  and 
task  allocation  [57],  among  others.  However,  both  the  bidders’  problem  (what  to  bid)  and  the 
auctioneer’s  problem  (assigning  items  to  bidders)  become  intractable  in  combinatorial  auctions. 
Hence,  actual  implementations  often  resort  to  heuristics  (e.g.,  handcrafted  bidding  rules),  and 
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therefore  to  suboptimal  solutions.  Therefore,  for  settings  without  such  coupled  constraints, 
fast  single-item  auction  algorithms  may  be  preferable,  especially  if  real-time  performance  is 
expected. 

Like  combinatorial  auctions,  our  AMDMA  algorithms  naturally  handle  constraints  coupling 
resources.  However,  unlike  most  work  using  combinatorial  auction  in  collaborative  domains,  our 
algorithms  use  iterative  bidding  on  small  sets  of  bundles  and  price  feedback  from  the  auctioneer, 
which  allows  the  agents  to  rebid  in  subsequent  iterations  toward  a  globally  optimal  solution.  Hence 
the  agents  may  not  need  to  communicate  their  preferences  on  exponentially  many  bundles  as  may 
be  the  requirement  under  combinatorial  auctions  in  some  settings.  Our  mechanism  is  most  similar 
to  iterative  combinatorial  auctions  [62],  which  have  been  mainly  studied  for  non-collaborative 
settings. 

We  note  that  many  auction  solutions  (e.g.  combinatorial  auctions)  are  mainly  studied  in  mechanism 
design  [24],  even  if  they  have  been  used  in  collaborative  settings.  The  goal  of  mechanism  design  is 
to  provide  incentives  to  potentially  self-interested  agents  in  order  to  achieve  a  global  objective  (e.g. 
social  optimum).  In  contrast,  we,  like  most  market-based  planners,  assume  collaborative  agents, 
and  focus  on  obtaining  the  optimal  solution  without  explicitly  trying  to  provide  incentives  for 
the  agents  to  follow  the  solution;  the  market  abstraction  provides  a  convenient  way  to  formulate 
distributed  multi-agent  planning  in  the  presence  of  shared  resources.  Hence  while  incentive¬ 
providing  auctions  are  more  robust  to  different  types  of  agents,  market-based  planners  are  able  to 
achieve  potentially  better  global  solutions  in  collaborative  settings. 

We  again  stress  that  our  methods  automatically  provide  resource  prices  without  domain  knowledge, 
and  potentially  to  optimality,  whereas  most  studies  under  either  auction  paradigm  engineer  markets 
and  resources  specifically  for  each  problem  domain,  and  often  without  theoretical  guarantees. 
AMDMAs  may  be  combined  with  problem-specific  approaches  to  yield  additional  resources 
that  are  more  specialized  but  principled  and  easier  to  design.  In  Price-and-Cut,  such  additional 
resources  are  based  on  bundles  of  resources,  created  to  compensate  for  inefficiencies  in  the  bundles 
currently  available  to  the  agents. 

Potentially  to  automate  resource  pricing,  machine  learning  techniques  have  been  applied  to  learn 
bidding  strategies.  The  most  relevant  work  is  by  Schneider  et  al.  [74],  who  use  the  notion  of 
opportunity  cost,  based  on  rewards  for  each  task,  to  learn  the  bidding  strategies  for  each  agent. 
However,  this  method  in  effect  still  requires  the  human  designer  to  price  resources,  or  tasks, 
through  rewards.  In  contrast,  learning  can  be  effective  for  automatic  pricing  in  scenarios  such 
as  oversubscribed  domains  where  not  all  tasks  are  expected  to  be  completed  but  a  penalty  is 
defined  for  unfinished  tasks  [40],  since  the  penalty  represents  a  natural  quantity  to  be  learned 
and  minimized.  Also,  learning  would  be  useful  if  uncertainty  existed  in  the  global  cost  function; 
however  this  setting  is  not  considered  by  the  learning-based  market-based  algorithms  mentioned 
above.  Hence,  in  the  mentioned  papers  learning  is  used  simply  as  a  substitute  for  optimization, 
which  may  be  sensible  if  efficiency  is  more  important  than  the  quality  of  the  solution,  as  it  may 
require  many  iterations  until  a  good  predictor  is  learned. 


52 


5.2  Decomposition-Based  Optimization  and  Planning 


Distributed  planning  algorithms  based  on  decomposition  methods  have  been  explored  in  the 
machine  learning  and  planning  communities.  The  most  relevant  works  include  that  of  Guestrin 
and  Gordon  [33]  for  hierarchically  factored  Markov  decision  processes  (MDPs),  of  Bererton  et 
al.  [9]  for  a  class  of  loosely  coupled  MDPs,  and  of  Calliess  et  al.  [14]  which  frames  the  market- 
based  interactions  as  a  game  between  adversarial  agents  computing  resource  prices  and  learning 
agents  that  represent  the  agents  in  the  original  problem.  However,  existing  work,  including  the 
aforementioned,  can  be  seen  as  applying  a  decomposition  for  linear  or  convex  programs,  which 
limits  them  to  infinitely  divisible  resources  only  (although  the  infinitely  divisible  solution  can 
often  be  a  reasonable  approximation  even  in  the  presence  of  discrete  resources). 

General  frameworks  for  Dantzig-Wolfe  decomposition  for  mixed  integer  programming  have 
been  explored,  particularly  in  the  operations  research  community  (e.g.,  [82,  84]).  Known  as 
branch- and-price  or  branch-and-price-and-cut,  these  frameworks  typically  focus  on  sequential 
or  tightly-coupled  parallel  execution.  They  use  branch-and-bound  for  solving  integer  programs, 
and  at  each  node  of  the  search  tree,  employ  Dantzig-Wolfe  decomposition  (and  in  some  cases 
cutting  planes)  to  solve  a  linear  program  relaxation  and  obtain  bounds.  (See  [6,  Ch.  6.7]  for 
a  good  overview.)  If  we  added  branching  to  our  algorithm,  it  would  fit  nicely  into  this  line  of 
research;  however,  it  is  not  clear  how  to  do  so  efficiently  and  robustly  in  a  distributed  framework. 
Much  attention  has  been  drawn  to  the  implementation  details  of  branch-and-price  algorithms 
[83]  which  would  need  to  be  resolved  in  the  distributed  setting;  in  particular,  keeping  track  of 
the  branch  tree  can  be  tricky,  and  it  is  an  art  to  find  good  branching  strategies.1  In  contrast,  our 
Price-and-Cut  algorithm  is  simple  to  implement  and  intuitively  distributed,  and  we  demonstrate 
distributed  operation  over  simple  socket-based  communication  links.  In  addition,  cuts  considered 
in  branch-and-price-and-cut  papers  are  often  problem-specific,  most  prevalently  for  the  vehicle 
routing  problem  with  time  windows  [64]. 

Thus  our  main  contribution  with  respect  to  distributed  MIP  optimization  is  twofold:  first,  removing 
branch-and-bound  from  the  combination  leads  to  an  algorithm  that  is  much  more  naturally 
distributed,  without  much  loss  in  efficiency  as  we  will  see  in  our  experiments.  Second,  previous 
work  has  not  applied  such  algorithms  to  distributed  multi-agent  planning;  we  draw  the  connection 
to  market-based  planning,  where  our  algorithm  provides  a  principled  way  to  introduce  new 
resources  and  set  prices  of  resources  (in  contrast  to  previous  heuristic  methods). 

While  not  based  on  a  decomposition,  distributed  constraint  optimization  (DCOP)  is  a  popular 
distributed  planning  method  that  frames  planning  problems  as  constraint  satisfaction  problems 
and  solve  them  in  a  fully-distributed  fashion  by  having  disjoint  sets  of  variables  assigned  to  each 
agent  and  using  a  search  algorithm  to  find  an  assignment  on  which  all  agents  agree  [56,  92],  No 
definitive  definition  seems  to  exist  in  the  multi-agent  planning  literature  for  DCOP,  but  most 
consider  DCOP  to  be  defined  for  discrete  variables,  to  assume  that  each  constraint  only  involves 
two  agents,  and  to  not  allow  hard  constraints.  On  the  other  hand,  distributed  constraint  satisfaction 

'One  possibility  is  to  extend  the  techniques  used  to  maintain  search  trees  in  distributed  constraint  optimization, 
although  DCOP  are  much  simpler  problems  than  general  MILPs. 
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problems  (DisCSP)  [94]  allow  hard  constraints  but  do  not  optimize  a  cost  function.  Hence  both 
DCOP  and  DisCSP  solve  simpler  problems  than  general  MILPs. 


5.3  Lagrangian  Relaxation  for  Multi- Agent  Planning 


Lagrangian  relaxation  is  used  heavily  in  large-scale  optimization  problems,  and  much  work  has 
been  focused  on  improving  the  quality  of  subgradient  descent  for  faster  convergence  [35,  43]. 
In  terms  of  applying  Lagrangian  relaxation  to  multi-agent  planning,  our  work  on  decomposing 
MILPs  (Chapter  2)  can  be  considered  a  form  of  Lagrangian  relaxation.  The  idea  of  leveraging 
bounded  influence  of  each  agent  on  resources  to  obtain  guarantees  for  Lagrangian  relaxation  in 
multi-agent  planning  was  introduced  by  Gordon  et  al.  [31].  They  give  an  approximation  guarantee 
for  the  subgradient  Lagrangian  relaxation  algorithm,  upon  which  we  base  our  near-optimality 
guarantees  for  the  optimization  methods  we  use  (Section  3.3.3). 

Guestrin  and  Gordon  [33]  give  a  distributed  planning  algorithm  for  factored  MDPs  [13],  a  closely 
related  formulation  to  our  MDP  framework  in  Chapters  3  &  4.  Whereas  we  apply  Lagrangian 
relaxation  to  a  scenario-based  approximation  of  the  joint  MDP  to  obtain  a  factored  problem,  they 
employ  an  approximation  to  the  Bellman  LP  of  a  factored  MDP  using  a  basis  and  subsequently 
apply  Lagrangian  relaxation  to  obtain  a  similarly  factored  problem.  Since  their  framework  is  based 
on  MDPs,  it  is  possible  to  plan  under  uncertainty  about  the  world  by  incorporating  a  scenario  tree 
into  the  transition  probabilities  as  we  do  in  Chapter  4.  However,  their  work  assumes  infinitely 
divisible  resources  and  thus  is  strictly  subsumed  by  our  formulation. 

The  trick  of  augmenting  the  objective  function  with  a  smooth  penalty  is  often  used  with  dual 
decomposition,  and  has  been  studied  at  least  since  Hestenes  [36,  37]  and  Powell  [65],  and 
subsequently  in  a  variety  of  convex  optimization  settings  [59,  60,  70,  71],  Such  methods  have 
seen  a  recent  surge  of  interest  in  machine  learning  and  optimization  communities  [16,  80,  86],  and 
in  some  cases,  optimizing  the  strength  of  the  strongly  convex  penalty  has  shown  to  be  an  effective 
extension  (e.g.  [16]).  Though  not  always,  augmentation  is  often  used  to  allow  optimization  with 
accelerated  gradient  methods  [16,  67,  68],  as  in  our  case  (Chapter  3). 


5.4  Multi-Agent  Planning  Under  Uncertainty 

Planning  under  uncertainty  in  market-based  distributed  multi-agent  planning  frameworks  is 
usually  handled  via  tight  coordination  between  agents  at  execution  (e.g.,  [42]),  where  little 
planning  is  done  with  respect  to  future  events2.  Such  run-time  coordination  could  be  beneficial 
in  our  framework,  which  currently  only  deals  with  coordination  and  communication  during  the 
deliberate  planning  stage.  Combining  the  two  (pre  and  during)  communication  approaches  would 
be  an  interesting  area  of  future  work. 

2  We  note,  however,  that  there  is  a  larger  body  of  work  on  planning  under  uncertainty  in  competitive  domains. 
Parkes  et  al.  [63]  give  a  good  overview  of  recent  work  on  incorporating  uncertainty  in  mechanism  design. 
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Stochastic  programming  and  scenario-based  optimization  [12]  have  been  applied  to  many  planning 
problems  including  supply  chain  management  [5,  73,  79].  While  supply  chain  management 
includes  a  wide  range  of  problems  from  facility  location  to  reverse  (return)  network  design,  many 
consider  problems  we  address  such  as  finding  optimal  contracts  between  agents  and  maintaining 
inventory  levels  under  demand  uncertainty  (see  [55,  77]  for  reviews).  Our  work  differs  from 
existing  work  in  two  main  areas.  First,  while  we  get  the  contract  prices  “for  free”  from  the 
resource  prices,  in  existing  work  they  are  explicitly  optimized  as  variables  in  the  problem, 
adding  to  the  design  and  optimization  effort  (see  e.g.  [4]).  Second,  stochastic  programs  in  most 
domains  are  often  solved  using  Benders  decomposition  (e.g.  [75]),  which  decomposes  the  problem 
into  subproblems;  however,  each  subproblem  corresponds  to  a  scenario,  in  comparison  to  our 
work  where  subproblems  correspond  to  agents3.  Hence,  while  both  decomposition  methods 
are  amenable  to  distributed  optimization,  ours  is  more  natural  in  a  multi-agent  setting,  as  the 
computation  is  assigned  to  the  entity  who  “owns”  the  subproblem,  which  results  in  far  less 
communication. 

There  has  also  been  interest  in  supply  chain  management  from  the  multi-agent  community  [88], 
which  puts  a  greater  focus  on  distributed  coordination.  Most  work  focuses  on  creating  incentives 
for  agents  to  coordinate,  from  optimizing  contracts  by  coordinating  at  local  levels  [1]  to  studying 
combinatorial  auctions  for  negotiating  contracts  in  competitive  settings  [90],  However,  similarly 
to  general  market-based  planning  frameworks,  uncertainty  is  rarely  a  main  concern,  and  thus  is 
usually  handled  by  coordination  at  execution,  without  deliberate  planning. 

Scenario-based  planning  under  uncertainty  has  also  been  studied  in  machine  learning,  in  particular 
in  the  context  of  reinforcement  learning.  One  notable  example  is  the  PEGASUS  algorithm 
proposed  by  Ng  et  al.  [93],  which  uses  sampling  (i.e.,  scenarios)  to  efficiently  plan  in  partially 
observable  Markov  decision  processes  (POMDPs).  A  particularly  relevant  line  of  research  stems 
from  work  by  Kearns  et  al.  on  planning  using  scenario  trees  for  single-agent  MDPs  [45]  and 
POMDPs  [46].  They  provide  efficient  planning  algorithms  for  (PO)MDPs  that  simulate  execution 
paths  from  the  (PO)MDP,  i.e.,  create  scenario  trees,  and  optimize  on  those  samples  instead  of  the 
entire  (PO)MDP.  Our  work  uses  a  similar  sampling  method  for  handling  uncertainty  in  MDPs, 
while  explicitly  dealing  with  coordination  in  multiagent  settings;  while  such  problems  can  be 
formulated  as  a  centralized  MDP,  our  use  of  factorization  and  decomposition  gives  us  a  naturally 
distributed  algorithm  with  smaller  problems  that  are  thus  more  efficient  to  optimize. 

A  general  framework  for  multiagent  planning  under  uncertainty  is  decentralized  POMDPs,  which 
extends  POMDPs  to  cooperative  multiagent  settings.  Many  variants  have  been  proposed  to 
increase  efficiency  by  approximation  or  by  imposing  further  assumptions  [8,  28,  34,  61,  91].  For 
example,  Spanan  et  al.  [78]  propose  a  method  that  plans  under  uncertainty  by  explicitly  modeling 
communication  as  part  of  the  optimization  problem,  where  communication  is  used  to  decrease 
both  uncertainty  in  other  agents’  actions  as  well  as  their  observations  which  may  be  unknown  to 
each  other.  Our  work  automatically  incorporates  communication  of  each  agent’s  intent  through 
resource  prices,  and  focuses  on  optimally  planning  under  uncertainty  in  observations  available 
globally  to  all  agents,  which  allows  us  to  use  scenario  trees  for  efficient  planning.  Whereas 
Spannan  et  al.  sequentially  optimize  each  agent  with  other  agents’  plans  fixed,  which  can  often 

3Our  methods  are  closely  related  to  Dantzig- Wolfe  decomposition,  which  is  the  dual  of  Bender’s  decomposition. 
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lead  to  suboptimality,  we  are  able  to  jointly  optimize  all  agents  using  our  automatic  pricing 
scheme. 


5.5  Supply  Chain  Management 

Many  aspects  of  supply  chain  management,  ranging  from  network  design  to  pricing  and  production 
optimization,  have  been  studied  in  operations  research  [55,  79].  The  most  relevant  line  of  work 
casts  the  pricing  and  production  problem  as  a  MILP  or  a  stochastic  program  [4,  5,  73,  77];  however, 
most  provide  a  centralized  solution,  and  the  use  of  decomposition  methods  are  limited  to  the 
usual  decomposition  of  stochastic  programs  by  scenarios  (e.g.  [75]),  which  is  not  amenable  to 
distributed  multi-agent  planning. 

Supply  chain  management  has  also  been  studied  from  a  multi-agent  perspective.  There  exists  a 
long  line  of  work  in  using  multi-agent  architectures  to  define  the  interactions  in  a  supply  chain, 
whether  with  automated  or  human  agents  (e.g.  [51]).  Much  effort  has  been  made  in  designing 
different  types  of  contracts  between  agents  to  obtain  an  efficient  supply  chain,  with  varying  levels 
of  focus  on  the  optimization  problem  at  each  agent.  Of  particular  interest  are  initiatives  to  study 
supply  chain  management  using  ideas  from  mechanism  design  and  auctions.  Notably,  Walsh 
and  Wellman  argued  that  the  multi-agent  community  is  in  a  particularly  good  position  to  solve 
supply  chain  management  problems  with  automated  agents  [87].  Much  of  the  follow  up  work 
has  involved  the  Supply  Chain  Trading  Agent  Competition  (TAC  SCM),  which  formulates  the 
supply  chain  management  as  a  game  between  self-interested  agents  competing  for  customers  and 
supplies  [3,  47,  89],  but  similar  ideas  have  also  been  explored  in  fields  ranging  from  management 
science  [15]  to  economics  [41].  While  many  multi-agent  based  settings  do  not  explicitly  consider 
uncertainty,  mechanism  design  research  is  an  exception;  an  interesting  case  study  is  that  of  Chick 
et  al.  [17]  on  influenza  vaccine  supply  chain  coordination  that  provides  a  cost-sharing  contract 
that  both  incentivizes  the  agents  as  well  as  achieves  global  optimization,  under  uncertainty  in  the 
efficacy  of  the  vaccines. 
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Chapter  6 

Conclusion  and  Future  Work 


In  this  thesis,  we  presented  a  general,  principled  framework  for  the  design  and  optimization  of  coor¬ 
dination  in  cooperative  market-based  multi-agent  planning  with  shared  constraints.  We  formalized 
multi-agent  planning  as  mixed  integer  programs,  and  gave  decomposition-based  distributed  plan¬ 
ning  algorithms  with  optimality  guarantees.  The  formalization  allowed  us  to  incoroporate  complex 
shared  constraints,  as  well  as  planning  under  uncertainty,  both  of  which  are  not  well-treated  in 
market-based  distributed  planning  literature.  Our  Automatically-pricing  Market-based  Distributed 
Multi-agent  Algorithms  (AMDMAs)  price  resources  automatically  and  optimally  given  only 
the  overall  objective  and  shared  constraints,  removing  the  often  time-consuming  efforts  used  in 
designing  a  market.  AMDMAs  allow  each  agent  to  solve  its  local  planning  problem  while  only 
communicating  its  intentions  regarding  shared  resources  to  other  agents,  and  the  communication 
protocol  is  defined  automatically  along  with  resource  pricing. 

We  studied  three  instances  of  AMDMA,  addressing  planning  by  exactly  solving  MILPs,  faster 
planning  via  relaxation,  and  planning  under  uncertainty: 

First,  in  Chapter  2  we  gave  a  novel  distributed  MILP  optimization  algorithm,  price-and-cut,  based 
on  Dantzig-Wolfe  decomposition  and  Gomory  cuts,  with  convergence  and  optimality  guarantees. 
Price-and-cut  can  be  seen  as  running  Dantzig-Wolfe  decomposition  on  the  LP  relaxation  of  the 
MILP,  while  adding  cuts  generated  by  the  Gomory  cutting-plane  method  to  further  and  further 
constrain  the  LP  until  we  find  a  (mixed)  integral  solution.  The  cuts  correspond  to  derivative 
resources  that  represent  bundles  of  existing  resources.  We  showed  price-and-cut’s  effectiveness  in 
distributed  planning  using  two  problem  domains:  1.  distributed  SAT  problems  as  an  abstraction 
for  planning  problems,  and  2.  an  unmanned  aerial  vehicle  task  allocation  and  path  planning  task. 
In  particular,  we  demonstrated  that  price-and-cut  can  improve  the  running  time  substantially  for 
large  problems,  in  comparison  to  CPLEX,  an  industry-standard  MILP  solver,  not  only  under 
distributed  execution,  but  even  when  it  is  run  sequentially. 

Next,  in  Chapter  3,  we  extended  our  framework  by  introducing  another  decomposition  method, 
Lagrangian  relaxation,  in  order  to  handle  larger  problems.  Programs  resulting  from  Langrangian 
relaxation  can  be  optimized  much  more  efficiently  using  convex  optimization  methods.  While 
Lagrangian  relaxation  does  not  in  general  come  with  solution  quality  guarantees,  for  an  impor- 
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tant  set  of  problems,  where  many  agents  must  share  the  resources,  our  Lagrangian  relaxation 
method  is  shown  to  give  near-optimal  solutions.  The  method  comprises  three  parts.  First,  we 
introduced  Langrangian  relaxation,  which  dualizes  the  shared  constraints.  Lagrangian  relaxation 
decomposes  the  original  MILP  into  subproblems,  much  like  price-and-cut.  However,  it  gives  us  a 
convex  optimization  problem  and  introduces  an  approximation,  as  it  in  effect  relaxes  the  shared 
constraints  to  be  satisfied  in  expectation.  Second,  we  gave  two  methods,  subgradient  projection 
and  an  accelerated  gradient  method,  FISTA,  to  optimize  the  Lagrangian  relaxation,  including  an 
augmentation  scheme  to  allow  the  use  of  FISTA  on  linear  programs.  In  particular,  we  proposed 
the  use  of  maximum  causal  entropy  penalty  for  MDPs,  which  enables  us  to  retain  fast  subproblem 
solvers.  And  finally,  we  gave  a  simple  but  effective  rounding  scheme  for  obtaining  a  feasible 
solution  to  the  original  MILP  given  a  solution  to  the  Lagrangian  relaxation.  For  problems  with 
a  large  number  of  agents  where  each  agent’s  influence  on  resource  consumption  is  limited,  our 
algorithms  combined  with  the  rounding  scheme  are  shown  to  given  near-optimal  solutions.  On  a 
difficult  traffic  management  problem,  we  showed  that  the  use  of  the  accelerated  gradient  method 
dramatically  improves  the  convergence  of  optimization,  at  little  loss  to  the  quality  of  the  solution, 
even  though  the  use  of  augmentation  leads  to  optimizing  a  perturbed  objective  function. 

Finally,  in  Chapter  4,  we  examined  distributed  planning  under  uncertainty  by  further  extending 
our  Lagrangian  relaxation  framework  using  multi-stage  stochastic  programming.  In  particular, 
we  gave  an  efficient  construction  of  factored  MDPs  under  uncertainty  based  on  sampling,  which 
is  in  fact  a  generalization  of  the  construction  of  the  factored  MDPs  in  the  deterministic  setting 
(Chapter  3).  We  studied  the  supply  chain  management  problem  as  an  example  domain  of  heavy 
dependencies  between  agents  and  non-trivial  uncertainty,  and  demonstrated  that  our  stochastic 
planning  algorithm  can  ameliorate  the  bullwhip  effect,  a  well-known  economic  phenomenon  of 
inefficiency  under  lack  of  planning  under  uncertainty. 


6.1  Future  Work 


While  we  have  laid  the  foundation  for  general,  principled  distributed  multi-agent  planning  with 
shared  constraints,  many  interesting  avenues  of  future  work  remain. 

While  AMDMAs  are  distributed  algorithms,  the  master  program  is  a  centralized  sequential 
computation,  whether  in  implementation  it  is  performed  on  one  server  or  on  multiple  agents 
simultaneously.  Distributing  the  master  program  among  the  agents  would  add  further  robustness 
to  failure  and  help  speed  up  computation.  Our  experiments  have  demonstrated  the  promise  of  the 
proposed  algorithms,  but  work  remains  to  make  them  more  scalable  for  larger,  more  complex 
problems  and  to  evaluate  them  in  additional  domains. 

Our  work  has  focused  on  deliberate  planning  before  execution.  However,  exploring  run-time 
coordination  via  communication  would  be  an  interesting  area  of  future  work.  In  principle,  run¬ 
time  communication  merely  requires  giving  the  agents  communication  actions  and  optimizing 
their  policies,  but  in  practice,  many  complications  are  likely  to  arise — e.g.,  how  expressive  a 
communication  language  can  we  give  the  agents  while  still  arriving  at  a  tractable  planning  problem 
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in  the  end?  In  addition,  it  would  be  interesting  to  investigate  whether  we  could  optimize  both  the 
deliberate  planner  and  the  reactive  (execution-time)  planner  jointly,  especially  knowing  what  the 
deficiencies  of  the  deliberate  planner  are,  such  as  its  relaxed  nature  (e.g.  Lagrangian  relaxation) 
or  its  limited  coverage  of  uncertainty  from  using  a  small  number  of  samples. 

With  respect  to  planning  under  uncertainty,  we  believe  an  immediately  impactful  work  would 
be  to  more  precisely  define  the  relationship  between  different  representations  for  planning  uncer 
uncertainty.  For  example,  we  gave  a  formulation  of  factored  MDPs  with  shared  constraints 
as  a  stochastic  program  in  Section  4.2.1.  Of  particular  relevance  is  the  precise  relationship 
between  factored  MDPs  with  shared  transition  functions  (e.g.  [32])  and  factored  MDPs  with 
shared  constraints  (Section  3.2.4).  We  believe  the  former  can  be  translated  to  the  latter  in  order 
to  take  advantage  of  our  distributed  algorithm.  However,  understanding  the  precise  relationship 
as  well  as  each  formulation’s  representational  power  will  allow  us  to  analyze  their  efficiency 
and  prescribe  a  “formula”  for  under  what  settings  each  formulation  is  more  practical.  Such  a 
characterization  will  help  researchers  and  practitioners  from  different  fields  including  operations 
research,  multi-agent  planning,  and  controls,  to  better  formulate,  construct  and  solve  multi-agent 
problems. 

Furthermore,  it  would  be  very  interesting  to  study  how  applying  our  uncertainty  handling  trick  to 
price-and-cut,  our  exact  MILP  solver,  would  interact  with  derivative  resources,  which  we  did  not 
study  in  Chapter  4.  Also,  it  may  be  beneficial  to  better  understand  the  results  of  scenario-based 
handling.  For  example,  uncovering  any  potential  patterns  in  the  resulting  plans  with  respect  to 
their  corresponding  scenarios  may  help  us  plan  even  more  efficiently. 

Finally,  it  would  be  very  interesting  to  extend  our  work  to  the  non-collaborative  setting  where 
agents  may  possess  interests  conflicting  with  the  global  objective  of  the  system  or  the  socially 
optimal  joint  plans.  Such  situations  arise  in  many  domains;  for  example,  in  supply  chain  manage¬ 
ment,  it  is  extremely  natural  for  each  agent  to  try  to  maximize  its  profit  without  regard  to  others’ 
profits.  In  fact,  an  agent  may  be  directly  competing  with  another  agent  for  sales.  Also,  even  when 
agents  are  cooperative,  they  may  not  want  to  reveal  everything  about  their  intentions. 

Planning  for  collaborative  agents  can  be  seen  as  a  global  optimization  problem,  hence  the  solutions 
we  have  described  so  far  are  general-purpose  distributed  solvers  for  MILPs.  However,  when 
planning  for  self-interested  agents,  we  desire  a  stronger  guarantee:  that  the  agents,  given  our 
market  mechanism,  will  act  to  converge  to  a  socially  optimal  joint  plan.  One  formalization  of  a 
social  optimum  is  a  game-theoretic  equilibrium:  if  agents  know  that  following  a  certain  strategy 
will  lead  to  an  equilibrium,  agents  have  no  incentive  to  change  their  strategy.  Another,  weaker 
formulation  is  a  boundedly-rational  equilibrium:  agents  can  not  discover  an  incentive  to  change 
their  strategy  using  some  limited  type  of  computation.  The  latter  formulation  may  lead  to  a 
smaller  price  of  anarchy,  and  thus  may  be  more  appropriate  for  us,  since  our  goal  is  to  not  only 
incentivize  the  agents  with  a  guarantee  of  equilibria  but  also  seek  some  notion  of  social  optimum, 
or  a  globally  good  solution. 

A  possible  approach  would  be  to  exploit  duality  and  no-regret  learning  in  an  analogous  fashion 
to  [14],  who  addressed  the  simpler  problem  of  divisible  resources,  leading  to  a  convex  optimization 
problem.  As  in  [14],  one  could  extend  the  master  LP  computation  in  price-and-cut  by  assigning 
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subsets  of  resources  to  the  agents  to  be  priced,  and  using  no-regret  learning  for  every  agent  to 
solve  their  subproblems  as  well  as  resource  pricing.  The  hope  would  be  to  establish  similar 
guarantees  to  [14],  which  gives  asymptotic  guarantees  on  the  cost  of  the  average  plan  (Theorem 
3.4,  [14])  and  its  convergence  to  a  Nash  equilibrium  (Theorem  4.1,  [14]). 

Certainly,  non-convexity  comes  with  an  additional  set  of  challenges.  The  main  challenge  will 
be  to  generate  cuts  in  a  distributed  manner  in  addition  to  solving  the  master  LP  in  a  distributed 
fashion.  To  do  so,  one  must  ensure  that  agents  are  given  an  incentive  to  only  generate  valid 
and  useful  cuts.  Also,  for  completeness,  theoretically  one  needs  be  able  to  generate  derivative 
resources  from  arbitrary  combinations  of  resources,  which  may  require  communication  between 
every  pair  of  resource-pricing  agents.  To  reduce  communication,  it  may  be  possible  to  perform 
further  decomposition  if  such  a  structure  is  present  in  the  problem,  or  we  can  explore  strategies 
for  finding  good  cuts  while  minimizing  communication,  which  could  be  implemented  as  agents 
discovering  agents  with  whom  to  communicate. 
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